SQL Server Forum / Other Technologies / Full-Text Search / March 2007
What is incremental?
|
|
Thread rating:  |
Ian Boyd - 23 Feb 2007 20:44 GMT Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing" http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx
It says regarding a full population:
"MSSearch will issue sp_fulltext_getdata once for every row in your table. So if you have 50 million rows in your database, this procedure will be issued 50 million times.
"On the other hand, if you use an incremental population...this statement will also be issued 50 million times. Why?"
That's exactly what i'd like to know. He continues:
"Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted."
So why is the incremental populator not getting the id of the rows that have changed after the last timestamp? What is so hard about:
SELECT PrimaryKeyColumn FROM SomeTable WHERE timestamp > 0000123456789
If full-population calls sp_fulltext_getdata for every row in the table, and incremental-population calls sp_fulltext_getdata for every for in the table, what is the difference between full and incremental populating?
And while we're at it, can we please, for the love of everything holy, fix EM so i can manage a full-text catalog from my development. And can be please, for the love of everything holy, include some way to see what populating method and change tracking scheme the full text indexing is using. And can we please, for the love of everything holy, let EM show if FT Search is running on the server. And while we're at it, can be please, for the love of everything holy, let me manage change tracking from EM. And while we're at it, can we please, for the love of everything holy, document this stuff in BOL. And while we're at it, can we please, for the love of everything holy shut me up.
Hilary Cotter - 23 Feb 2007 21:36 GMT Hmmm........ this article looks familiar. Basically the indexer extracts the entire row to look at the value of the timestamp column to determine if it has been modified since the last time it looked at that row. If the timestamp column is modified the full-text indexed columns will be re-indexed.
It also checks to see if new rows are added and then compares all the key and timestamp info returned from all the row with what is in the index to figure out what has been deleted.
The problem with this query SELECT PrimaryKeyColumn FROM SomeTable WHERE timestamp > 0000123456789
is it doesn't detect deletes.
A full-index blindly reinidexes everything. The incremental only indexes the rows which have been modified or newly inserted. So it does more bookkeeping than indexing and can't detect if the update was on a column which is not being full-text indexed.
While this might sound inefficient, it actually is very efficient and reliable.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
> Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing" > http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx [quoted text clipped - 35 lines] > document this stuff in BOL. And while we're at it, can we please, for the > love of everything holy shut me up. Ian Boyd - 26 Feb 2007 22:04 GMT > is it doesn't detect deletes. What if we call sp_fulltext_getdata CatalogID, object_id
Which returns a list of ALL rows in the table, and their timestamps. Any rows in our full-text catalog, but not in the returned list, have been deleted.
Then, we call sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID
only for those rows that that have a new timestamp than our largest.
So we get the best of both worlds. We find deletes, and we only scan rows changed that have changed since our last incremental scan.
Hilary Cotter - 27 Feb 2007 03:20 GMT Yes, you can do this, just keep in mind that each time you fire this proc you reset an internal counter saying this is the highest timestamp, so each time you fire it MSSearch starts thinking that the timestamp you generated by firing this proc manually is its reference point and it can miss modified rows.
Basically when fire sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID yourself it throws away the values, in other words only when MSSearch fires this proc does the row get reindexed.
So, I really don't recommend it.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> is it doesn't detect deletes. > [quoted text clipped - 12 lines] > So we get the best of both worlds. We find deletes, and we only scan rows > changed that have changed since our last incremental scan. Ian Boyd - 01 Mar 2007 14:54 GMT > Yes, you can do this, just keep in mind that each time you fire this proc > you reset an internal counter saying this is the highest timestamp, so [quoted text clipped - 5 lines] > RowPrimayKeyID yourself it throws away the values, in other words only > when MSSearch fires this proc does the row get reindexed. i wasn't meaning me. i meant why isn't SQL Server doing this since it's faster and more efficient?
Hilary Cotter - 01 Mar 2007 17:33 GMT I believe it is, it first fires the proc to get a list of what has change and updates the time stamp, then it gathers those rows and indexes them, and then it checks to see what has been deleted.
Why do you think it works differently?
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Yes, you can do this, just keep in mind that each time you fire this proc >> you reset an internal counter saying this is the highest timestamp, so [quoted text clipped - 8 lines] > i wasn't meaning me. i meant why isn't SQL Server doing this since it's > faster and more efficient? Ian Boyd - 03 Mar 2007 05:12 GMT > Why do you think it works differently? From the article "10 Ways to Optimize SQL Server Full-Text Indexing" http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx
"MSSearch will issue sp_fulltext_getdata once for every row in your table. So if you have 50 million rows in your database, this procedure will be issued 50 million times.
"On the other hand, if you use an incremental population...this statement will also be issued 50 million times."
Perhaps it's a typo, and was supposed to read: "On the other hand, if you use an incremental population...this statement will NOT be issued 50 million times, but rather will only be executed once for each modified row in the table. Normally there will be only one or two rows, rather than 50 million."
Hilary Cotter - 05 Mar 2007 13:15 GMT Nope, its not a typo. The row has to be extracted to get the timestamp for that column. Consider an you kick off your indexing at 12:00, it extracts that row at 12:50. Now, suppose that row has modified since 12:00. The indexer has to know whether to index that row or not. The 12:00 row set says it doesn't need to, but the Indexer needs a more up to date value, hence it grabs the most recent timestamp.
Make sense?
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Why do you think it works differently? > [quoted text clipped - 14 lines] > executed once for each modified row in the table. Normally there will be > only one or two rows, rather than 50 million." Ian Boyd - 05 Mar 2007 19:38 GMT > Nope, its not a typo. The row has to be extracted to get the timestamp for > that column. Consider the indexing kicks off at 12:00. It gets a lit of all primary keys in the table, and that row's timestamp:
LastTimestamp PrimaryKey ========= ======= 30043 1 32364 2 30044 3 ... 32363 49584682 etc.
Any rows not in this list get purged from the FullText catalog.
Next the full-text indexer know that when it last crawled the table, the timestamp was at 32363. So it can search through the above list, and only fetch rows that have a newer timestamp.
Timestamp PrimaryKey ======= ======= 32364 3
And so rather than having to pull the contents of 50million rows, we only pull the contents of one row. That's a speed improvement by a factor of 50 million!
And if a row was modified during the 2 seconds it took to do our incremental scan, we can just scan again - since we've sped up the incremental indexing by a factor of 50 million, we can afford to scan more frequently. Perhaps even as quickly as every 6 or 7 seconds.
> Consider an you kick off your indexing at 12:00, it extracts that row at > 12:50. Now, suppose that row has modified since 12:00. The indexer has to > know whether to index that row or not. The 12:00 row set says it doesn't > need to, but the Indexer needs a more up to date value, hence it grabs the > most recent timestamp. This sounds like a full-scan, where we scan every row no matter what. i'm talking about an incremental full-text catalog build. Why does incremental not do as i've described? Everyone who comes into SQL Server Full-Text Indexing assumes it does an incremental scan - and then cannot figure out why it's scanning the whole table.
If you're trying to tell me that incremental and full do the same thing (i.e. scan every row in the table), then i return to my original question: What is incremental?
Hilary Cotter - 06 Mar 2007 21:05 GMT but the timestamp retrieved is the value of the timestamp when the first list was obtained. It then needs to retrieve the timestamp again to see if it has changed since the first full-list retrieval. I suspect the first full-list is to get the crawl seeds and more importantly to figure out what to delete.
Consider this case. You get the highest value timestamp and compare it with the highest value timestamp from the last run. You only index those rows. What about the rows which have modified while you are indexing this delta. By only doing this delta you miss those ones.
You might want to make your case on connect to see what the office MS answer is.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Nope, its not a typo. The row has to be extracted to get the timestamp >> for that column. [quoted text clipped - 45 lines] > (i.e. scan every row in the table), then i return to my original question: > What is incremental? Ian Boyd - 08 Mar 2007 05:32 GMT > but the timestamp retrieved is the value of the timestamp when the first > list was obtained. It then needs to retrieve the timestamp again to see if [quoted text clipped - 6 lines] > rows. What about the rows which have modified while you are indexing this > delta. By only doing this delta you miss those ones. You're describing a solution to a problem that doesn't need to be solved. At the very least full repopulation doesn't solve this problem either.
Consider full-repopulation:
12:00:00pm: Get list of all rows and their timestamps. Only row 37983 has changed 12:00:02pm: Crawl row 1, in case it changed since we got the list 12:00:02pm: Crawl row 2, in case it changed since we got the list .... 12:49:37pm: Crawl row 37983, because we know it changed 12:49:37pm: Row 3 changes - after we've crawled it 12:49:37pm: Ccrawl row 37984, in case it changed since we got the list ... 6:13:31pm: Crawl row 50123456, in case it changed since we got the list 6:13:31pm: Crawl complete.
But we missed the change to row 3. The full-repopulation STILL misses the update. So i'll ask you: "What about the rows which have modified while you are indexing this delta?"
Does the full rebuild algorithm perform endless passes until no rows have changed during the full scan? No, there would never be an end. This is an accepted limitation of the algorithm, and any changes we'll pick up the next time a full repopulation is run. What's more, because a full repopulation is so unnecessarily lengthy, it essentially guarantees that rows will change during the full re-scan - and that they will get missed.
So, my original question: If an increment rebuild is the same as a full rebuild, what's the difference between incremental and full? Or, more simply, "What is incremental?"
Now consider incremental-repopulation as everyone assumes it works:
12:00:00pm: Get list of all rows and their timetamps. Only row 37983 has changed 12:00:02pm: Row 3 changes 12:00:02pm: Crawl row 23983, because it has changed 12:00:02pm: Crawl complete
In both examples we miss a change to a row. But in the ideal incremental rebuild algorithm, the window where that change could have happened is much smaller (on the order of 50 million times smaller). And since we've already accepted that if row changes after we've examined it, then it will just have to get picked up next time the incremental (or full) population is run. Additionally, since the incremental population only took 2 seconds, we can run it much more frequently; getting updates in near realtime.
The difference is: the incremental algorithm is faster, with less load on the server, and row changes making it into the full text catalog faster. At least, that's what every MSSQL dba assumes the incremental repopulation does. Until they begin seeing high load on their server, and they post into m.p.s.f asking what SQL Server is doing.
My collary question is: why isn't SQL Server doing this already? There's the "full" algorithm. Is there any other algorithm? Which brings me to my primary question:
Is there an "incremental" algorithm of some sort? If yes, what does the incremental repopulation algorithm do differently from the "full"? What is incremental?
According to an article on TechNet, there's no difference. So then, why the two names?
> You might want to make your case on connect to see what the offial MS > answer is. i don't know what connect is. Nor do i believe i would get an answer.
Simon Sabin - 08 Mar 2007 20:20 GMT Hello Ian,
Isn't the difference that incremental looks to see if the record has changed and reindex if required. With full it just reindexes the record.
Simon Sabin SQL Server MVP http://sqlblogcasts.com/blogs/simons
>> but the timestamp retrieved is the value of the timestamp when the >> first list was obtained. It then needs to retrieve the timestamp [quoted text clipped - 80 lines] >> > i don't know what connect is. Nor do i believe i would get an answer. Ian Boyd - 08 Mar 2007 21:05 GMT > Isn't the difference that incremental looks to see if the record has > changed and reindex if required. With full it just reindexes the record. i see what he's trying to say.
Incremental offers no performance improvement over full, because it still reads every single row in the database, one by one, not even as a set of rows.
Hilary Cotter - 09 Mar 2007 02:05 GMT Both full and incremental extract each row. Incremental will not index each row, but full will. Still full can be faster than incremental as it doesn't have to do any bookeeping or index maintenance.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Isn't the difference that incremental looks to see if the record has >> changed and reindex if required. With full it just reindexes the record. [quoted text clipped - 4 lines] > reads every single row in the database, one by one, not even as a set of > rows. Hilary Cotter - 08 Mar 2007 20:43 GMT let me try to explain this again.
Full population. Every row is extracted and indexed.
Incremental population. Every row is extracted, timestamps are compared with the last timestamp the last full or incremental population ended on and if the timestamp is different the extracted row is reindexed. Rows which are no longer in the table but in the catalog are removed from the catalog. Rows which are not in the catalog are inserted.
Does this make sense? The cost of bringing the entire row is not much more expensive from just bringing the timestamp so its not that inefficient.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> but the timestamp retrieved is the value of the timestamp when the first >> list was obtained. It then needs to retrieve the timestamp again to see [quoted text clipped - 76 lines] >> answer is. > i don't know what connect is. Nor do i believe i would get an answer. Ian Boyd - 08 Mar 2007 21:03 GMT > Full population. Every row is extracted and indexed. > [quoted text clipped - 3 lines] > which are no longer in the table but in the catalog are removed from the > catalog. Rows which are not in the catalog are inserted. If i understand what you're staying - then someone the SQL team needs a smack.
Both "full" and "incremental" both scan scan every row, one by one, in the database, whether the row has changed or not. Why, in the name of god, would they not have an actual "incremental" scan? How is this not a no-brainer?
> Does this make sense? The cost of bringing the entire row is not much more > expensive from just bringing the timestamp so its not that inefficient. i disagree. Put an index on PK, Timestamp.
Fetching every row in the database, one by one; it's like using a cursor to do everything. There's so much to be gained by using SQL Server to do what it's good at: operate on sets.
My god, they really should add an "incremental" update option.
Simon Sabin - 08 Mar 2007 23:14 GMT Hello Ian,
Thats what change tracking is for.
Simon Sabin SQL Server MVP http://sqlblogcasts.com/blogs/simons
>> Full population. Every row is extracted and indexed. >> [quoted text clipped - 24 lines] > > My god, they really should add an "incremental" update option. Hilary Cotter - 09 Mar 2007 00:47 GMT some of the reasons are that not every table is guaranteed to have a timestamp column on it, nor can Microsoft guarantee that their user's won't update the pk's. Suppose you update a pk on your table. How is the indexer supposed to know what row the data in the catalog belongs to? Microsoft can't count on their users being as sophisticated as you obviously are.
As Simon mentions change tracking is designed to provide real time indexing. Full-populations are designed when the majority of your data changes at one time. Incremental is designed when large portions of your data changes at one time.
Keep in mind that the MSSearch engine provided search services to Exchange Content Indexing, MS Office search, Site Server Search and SharePoint Portal Search. They needed a consistent crawl or population method for all population/crawl types.
In SQL 2005 the crawl mechanism has been totally re-written as a database only indexer. Still incremental population remains as it is the fastest crawl/population type in some cases.
Its the bookkeeping that MSSearch must do in order to reconcile what is in the catalog with what is in the table which makes the extraction all the rows in the base table essential.
This crawl algorithm is also present in Sybase and Oracle last time I looked.
 Signature Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Full population. Every row is extracted and indexed. >> [quoted text clipped - 23 lines] > > My god, they really should add an "incremental" update option. Ian Boyd - 09 Mar 2007 14:22 GMT > some of the reasons are that not every table is guaranteed to have a > timestamp column on it i'd be fine with a timestamp being required to to FT indexing; like some forms of replication need a "IsUniqueIdentifier" row.
> nor can Microsoft guarantee that their user's won't update the pk's. You upate the PK, but don't forget that the timestamp updates with it. We'll get the "new" values associated with that row next run.
> As Simon mentions change tracking is designed to provide real time > indexing. Full-populations are designed when the majority of your data > changes at one time. Incremental is designed when large portions of your > data changes at one time. But both heavily impact SQL Server to the same degree.
> Keep in mind that the MSSearch engine provided search services to Exchange > Content Indexing, MS Office search, Site Server Search and SharePoint > Portal Search. They needed a consistent crawl or population method for all > population/crawl types. It's an implementation detail that Microsoft SQL Server implements it's FullText search ability as a separate process. Integrate it into the engine, and it's not a problem. If they need to choose a different implementation: i'm fine with that.
> In SQL 2005 the crawl mechanism has been totally re-written as a database > only indexer. Still incremental population remains as it is the fastest > crawl/population type in some cases. If this goes toward my previous statement then that's excellent news.
> Its the bookkeeping that MSSearch must do in order to reconcile what is in > the catalog with what is in the table which makes the extraction all the > rows in the base table essential. See points 1 and 2
> This crawl algorithm is also present in Sybase and Oracle last time I > looked. Which means there is room for improvement and innovation on Microsoft's part to make a superior product.
Hilary Cotter - 09 Mar 2007 15:53 GMT Comments in line.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> some of the reasons are that not every table is guaranteed to have a >> timestamp column on it > > i'd be fine with a timestamp being required to to FT indexing; like some > forms of replication need a "IsUniqueIdentifier" row. The unique identifier used in replication is used to track which rows have changed in which tables and which servers. It does not have a timestamp component. While you could use it for a timestamp it is too wide.
>> nor can Microsoft guarantee that their user's won't update the pk's. > > You upate the PK, but don't forget that the timestamp updates with it. > We'll get the "new" values associated with that row next run. That's right you update the pk, the timestamp tell the indexer that it has been updated since the last time, but the key information is gone. So last run it had a key value of 1 and the catalog knows that X number of words in the index belong to row 1, but now the value of row 1 is 13424. How does it reconcile this to remove these enteries and update them with the correct new pk value?
>> As Simon mentions change tracking is designed to provide real time >> indexing. Full-populations are designed when the majority of your data >> changes at one time. Incremental is designed when large portions of your >> data changes at one time. > > But both heavily impact SQL Server to the same degree. Not always. The extraction is typically less expensive than the indexing, especially when you are dealing with blobs and Far Eastern content.
>> Keep in mind that the MSSearch engine provided search services to >> Exchange Content Indexing, MS Office search, Site Server Search and [quoted text clipped - 5 lines] > engine, and it's not a problem. If they need to choose a different > implementation: i'm fine with that. In SQL 2005 the indexer is in process.
>> In SQL 2005 the crawl mechanism has been totally re-written as a database >> only indexer. Still incremental population remains as it is the fastest [quoted text clipped - 7 lines] > > See points 1 and 2 Please see my answers to them:)
>> This crawl algorithm is also present in Sybase and Oracle last time I >> looked. > > Which means there is room for improvement and innovation on Microsoft's > part to make a superior product. Which means you better go looking to smack the entire search engine developers, not just the MS ones.
Ian Boyd - 10 Mar 2007 23:08 GMT >> i'd be fine with a timestamp being required to to FT indexing; like some >> forms of replication need a "IsUniqueIdentifier" row. > >While you could use it for a timestamp it is too wide. No, i was saying that's there's already a precident for mandating a column's presence in a table to be able to use a feature. So there's no problem in mandating a timestamp be present in order to use incremental population
> That's right you update the pk, the timestamp tell the indexer that it has > been updated since the last time, but the key information is gone. So last > run it had a key value of 1 and the catalog knows that X number of words > in the index belong to row 1, but now the value of row 1 is 13424. How > does it reconcile this to remove these enteries and update them with the > correct new pk value? Okay, i guess i need to have a *detailed* example to explain how this works.
Assume we want to catalog a simple Customers table
CREATE TABLE Customers ( pk int NOT NULL PRIMARY KEY, Firstname varchar(25), Lastname varchar(25), [timestamp] timestamp)
Now, assume this table has been pre-filled with two rows:
pk Firstname Lastname timestamp 1 Ian Boyd 970001 2 Hillary Cotter 970002
Incremental Population Nº1 ===================== Let's do our "initial" incremental population. We record the last timestamp every time we run the incremental population. Since this is the first time an incremental population has been run, we assume a "LastTimestamp" of zero.
The first step in performing an incremental population is to get every Primary key value, and it's timestamp.
NOTE: Perhaps there is some special stored procedure that can be used to get this list; perhaps being called sp_fulltext_getdata. We won't use that for my algorithm, we're use something more clear - for the purposes of explaining the incremental population algorithm.
> SELECT pk, timestamp FROM Customers pk timestamp 1 970001 2 970002 2 row(s) affected
These are all the rows the table contains. Any entries in our full-text catalog that do not exist in this list need to be removed from the full-text catalog; since they have been deleted from the source database. In our example, this is the first time the full-text indexer has been run on our Customers table, so the full-text catalog is empty.
Next, we scan the recordset, looking for any rows with a timestamp greater than zero. We find two rows:
pk timestamp 1 970001 2 970002
So, there are two rows that need to go into the full-text catalog. The full-text indexer issues a statement to get the contents of the Customers table for primary key value 1.
NOTE: This can be some special SQL Server stored procedure (perhaps named sp_fulltext_getdata) that is used to get the values of columns for a specific row. We're going to do it the obvious way; for the purposes of explaining this algorithm
> SELECT * FROM Customers WHERE pk = 1 pk Firstname Lastname timestamp 1 Ian Boyd 970001 1 row(s) affected
The full-text application now takes those values, and parses them into whatever format it uses and stores them wherever it stores them in the catalog. Now it get's the next row:
> SELECT * FROM Customers WHERE pk=2 pk Firstname Lastname timestamp 2 Hillary Cotter 970002 1 rows(s) affected
And now those values are put into the full-text catalog. And the catalog is up to date.
Now let's say someone modifies a row:
> UPDATE Customers SET Firstname = 'Hilary' WHERE pk = 2 1 rows(s) affected
Incremental Population Nº2 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97002. It's first step is to find all rows in the table, and their timestamps
> SELECT pk, timestamp FROM Customers pk timestamp 1 970001 2 970003 2 row(s) affected
First we delete any rows from our catalog that are not in the database. Since both rows are still there, we proceed to the next phase. Look through the recordset and find rows that have a timestamp greater than 970002. There is one row that matches:
pk timestamp 2 970003
So the indexer fetches the contents of that row:
> SELECT * FROM Customers WHERE pk=2 pk Firstname Lastname timestamp 2 Hilary Cotter 970002 1 rows(s) affected
And now those values are put into the full-text catalog. And the catalog is up to date.
Now let's say someone updates the primary key in the table
> UPDATE Customers SET pk = 3 WHERE pk = 2 Incremental Population Nº3 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97003. It's first step is to find all rows in the table, and their timestamps
> SELECT pk, timestamp FROM Customers pk timestamp 1 970001 3 970004 2 row(s) affected
First we delete any rows from our catalog that are not in the database. But now we see that our full-text catalog has data for a row with pk=2, but that row doesn't exist in the database. So rows with pk=2 is purged from the full-text catalog, and we proceed to the next phase. Look through the recordset and find rows that have a timestamp greater than 970003. There is one row that matches:
pk timestamp 3 970004
So the indexer fetches the contents of that row:
> SELECT * FROM Customers WHERE pk=3 pk Firstname Lastname timestamp 3 Hilary Cotter 970004 1 rows(s) affected
And now those values are put into the full-text catalog. And the catalog is up to date.
Now let's say someone is really devious and swaps the PK values of rows 1 and 3
>UPDATE Customers SET pk = 4 WHERE pk = 1 1 rows(s) affected
> UPDATE Customers SETpk = 1 WHERE pk = 3 1 row(s) affected
> UPDATE Customers SET pk = 3 WHERE pk = 4 1 row(s) affected
Incremental Population Nº4 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97004. It's first step is to find all rows in the table, and their timestamps
> SELECT pk, timestamp FROM Customers pk timestamp 1 970007 3 970006 2 row(s) affected
First we delete any rows from our catalog that are not in the database. All rows in our catalog exist in the database, so we proceed to the next step. The last timestamp when we ran was 970004, so we look for rows that have a newer timestamp, and we find two rows:
pk timestamp 1 970007 3 970006
So the indexer fetches the contents of the the first row:
> SELECT * FROM Customers WHERE pk=1 pk Firstname Lastname timestamp 1 Hilary Cotter 970007 1 row(s) affected
And the catalog is updated with these values. Now we fetch the other modified row:
> SELECT * FROM Customers WHERE pk = 3 pk Firstname Lastname timestamp 3 Ian Boyd 97006 1 rows(s) affected
And now those values are put into the full-text catalog. And the catalog is up to date.
Now lets say someone delete a row:
>DELETE FROM Customers WHERE pk = 3 Incremental Population Nº4 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97007. It's first step is to find all rows in the table, and their timestamps
> SELECT pk, timestamp FROM Customers pk timestamp 1 970007 1 row(s) affected
First we delete any rows from our catalog that are not in the database. In this case, our full-text catalog as a row with pk=3 that is no longer in the table, so we purge that row from our catalog. Next we look for all rows that have a timestamp newer than 97007. There is none, so the indexing is up to date.
Find for me a case where the algorithm fails
> The extraction is typically less expensive than the indexing, especially > when you are dealing with blobs and Far Eastern content. And it's even less taxing to not query for a row needlessly.
Hilary Cotter - 12 Mar 2007 14:42 GMT Ian you are describing how an incremental population runs with one exception. So the indexer fetches the contents of that row:
SQL Server has to poll the entire table to find out what is in it. It then compares this with what the fulltext catalog has. It then can detect what is deleted.
An update can be considered a delete followed by an insert.
Your method describes how to detect what is new (inserts and updates), but it doesn't describe a method to determine what is deleted.
Can you help me to understand how your method detects deletions?
To quote you "First we delete any rows from our catalog that are not in the database." and "Any entries in our full-text catalog that do not exist in this list need to be removed from the full-text catalog; since they have been deleted from the source database" and "First we delete any rows from our catalog that are not in the database." and again 'First we delete any rows from our catalog that are not in the database. " And one more time "First we delete any rows from our catalog that are not in the database. "
Exactly how does the indexer know what is deleted knowing only what is new?
The algorithm that the crawl mechanism in MSSearch uses is to say send me all the pks and timestamps and I'll compare this list with what is in the catalog.
This way I detect what is new (in a similar manner to what you describe), and also what is to be deleted.
Am I missing something?
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>>> i'd be fine with a timestamp being required to to FT indexing; like some >>> forms of replication need a "IsUniqueIdentifier" row. [quoted text clipped - 244 lines] > > And it's even less taxing to not query for a row needlessly. Ian Boyd - 12 Mar 2007 16:26 GMT > An update can be considered a delete followed by an insert. > > Your method describes how to detect what is new (inserts and updates), but > it doesn't describe a method to determine what is deleted. > > Can you help me to understand how your method detects deletions? Let's say our table and our catalog are currently in sync
Contents on the Customers table: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002
NOTE: i'll now store first and last names in the FirstName field, so that we can see words being broken up. NOTE: i do not know the format that is used to store the full-text catalog; so i'll have to make one up.
Contents of the FullTextCatalog table: Table Key Column keyword Customers 1 Firstname Ian Customers 1 Firstname Alexander Customers 1 Lastname Boyd Customers 2 Firstname Hilary Customers 2 Firstname Nathan Customers 2 Lastname Cotter
And the last time the incremental population ran the timestamp in the Customers table was 970002. The full-text indexer also must store this:
Contents of the Full-text catalog LastTimestamps table: Table LastTimestamp Customers 970002
Everything is setup. Now we delete a row:
>DELETE FROM Customers WHERE pk = 2 1 row(s) affected
Now there is only one row in the customer's table:
>SELECT * FROM Customers pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 1 row(s) affected
Now the incremental indexer comes along. First thing is does it ask for all key-timestamp pairs:
>SELECT pk, timestamp FROM Customers pk timestamp 1 970001 1 row(s) affected
So your question is: How does the indexer know that a row has been deleted? Specifically, how does it know that row 2 has been deleted?
Looking through the full-text catalog, we find all rows that we've previously indexed:
>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers' pk 1 2 2 row(s) affected.
Our full-text catalog contains values from two rows. But in the list from the database that we fetched a moment ago there is only one row:
>SELECT pk, timestamp FROM Customers pk timestamp 1 970001 1 row(s) affected
Obviously row 2, that exists in our full-text catalog, has been delete from the source database, and must be deleted from our full-text catalog.
>DELETE FROM FullTextCatalog WHERE Table='Customers' >AND Key = 2 3 row(s) affected
This is how we detect deletes.
Perhaps now you're wanting to contrive a more pessimistic case. Let's take the original values:
pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002
and delete Hilary, and switch Ian to pk value #2, resulting in:
pk Firstname Lastname 2 Ian Alexander Boyd
So now we've deleted a row, but also moved another row on top of it. How do we handle this?
Well, starting from: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002
let's do the SQL to make the change:
>DELETE FROM Customers WHERE pk = 2 1 row(s) affected
Giving us: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001
>UPDATE Customers SET pk = 2 WHERE pk = 1 1 row(s) affected
Giving us: pk Firstname Lastname timestamp 2 Ian Alexander Boyd 970003
Now the incremental indexing starts, and it's first step is to get a list of all pk-timestamp pairs:
>SELECT pk, timestamp FROM Customers pk timestamp 2 970003 1 row(s) affected
Now we know that a row has been deleted, but how does the full-text indexer know? It looks through it's full-text catalog:
>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers' pk 1 2 2 row(s) affected.
The catalog has two rows, but the current able only has one. Specifically, row with pk=1 is no longer in the source database, while it is still in our catalog. So let's remove that row from the catalog:
>DELETE FROM FullTextCatalog >WHERE TableName = 'Customers' >AND Key = 1 1 row(s) affected
"Ah-hah!", you say. "How does it know that row 2 is not the same row 2 in the catalog?" From the timestamps. The indexer stores the last timestamp that was in a table the last time in ran:
>SELECT * FROM LastTiemeStamps >WHERE Table = 'Customers' Table LastTimestamp Customers 970002 1 row(s) affected.
So any rows that have a timestamp greater than 970002 need to be re-populated. Seeking through the list in memory that we already retreived (which i'll reproduce here for convience):
pk timestamp 2 970003
indicates that one row has been modified.
So the full-text indexer throws away any data in it's catalog for Key=2, and puts in it's place values from the current row of pk=2.
Makes sense?
Hilary Cotter - 12 Mar 2007 17:23 GMT What you describe is exactly how the indexer works. It gets stores the last highest timestamp form the last run.
It uses this as the basis to find out what is new. It then extracts each pk and timestamp to determine what to delete and by extension what has been updated. It also extracts the content of the row at that time as the expense of this hit is not much more than the cost of getting the timestamp and the pk.
This is what I tried to explain with my statement "Incremental population. Every row is extracted, timestamps are compared with the last timestamp the last full or incremental population ended on and if the timestamp is different the extracted row is reindexed. Rows which are no longer in the table but in the catalog are removed from the catalog. Rows which are not in the catalog are inserted.
Does this make sense? The cost of bringing the entire row is not much more expensive from just bringing the timestamp so its not that inefficient."
Your response was "If i understand what you're staying - then someone the SQL team needs a smack."
Am I missing something? Do I need to be smacked? Did I do a poor job of explaining this?
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> An update can be considered a delete followed by an insert. >> [quoted text clipped - 166 lines] > > Makes sense? Ian Boyd - 12 Mar 2007 19:40 GMT > What you describe is exactly how the indexer works. It gets stores the > last highest timestamp form the last run. [quoted text clipped - 4 lines] > expense of this hit is not much more than the cost of getting the > timestamp and the pk. No, the hit *is* much more. We will, of course, have a covering index on pk, Timestamp
so that when the full-text indexer performs the query
>SELECT pk,timestamp FROM Customers it only has to scan the index. This is much faster than having to scan every row of the database.
After that, once we've found the few rows that have changed, it's much better for the server to only fetch those specific few rows, rather than querying every row in the table one. Hell, i'd prefer it if we didn't have to issue a separate select for each modified row, but if we could use a set-based operation
>SELECT * FROM Customers >WHERE TimeStamp > 970002 Assuming there's less than a few hundred modified rows.
Do you disagree that having a convering index on pk,timestamp and only querying for few dozen modified rows is better than querying for all rows in a table, whether they are used are not - and not even as a set or a batch, but one row at a time?
Assuming a 50,000,000 row table. If the pk is a 4-byte integer, and the timestamp is a 4-byte ordinal, and you create a covering index on those two, you can fit a little over 1000 rows in an 8k page. That's works out to about 50k pages of i/o. If there is then a dozen modified rows, and those rows are randomly scattered around the database, that's 84 more page reads (assuming b-tree traversal is log(n), with 6,250,000 pages), for a total of 50,084 pages of i/o.
On the other hand, using the current SQL Server incremental algorithm, we have to scan every page in a table. Assume each row in the table is 1000 bytes wide. That's about 8 rows per page. That then requires 6,250,000 pages of i/o to get the entire database.
50,084 pages to read vs 6,250,000 pages to read
That's about 125x more i/o generated using the "brute force" method, over than the finesse method.
So, this is where i come to the point in my head asking, "Why do i have to explain this to the SQL Server team?"
> Does this make sense? The cost of bringing the entire row is not much more > expensive from just bringing the timestamp so its not that inefficient." What we need to do is stop bringing in a whole row, just to get the pk and timestamp; and only bring in the whole row when it is absoutely necessary.
Hilary Cotter - 13 Mar 2007 04:35 GMT Microsoft does recommend putting an index on the timestamp column. Their customers need the flexibility to create indexes and tables as they see fit. Best practices might dictate using a covering index in some cases. For example some application might require the dba not modify the table structure in any way.
Rather than going around in circles with me you should go to connect and make your suggestions there and have Microsoft respond to them.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> What you describe is exactly how the indexer works. It gets stores the >> last highest timestamp form the last run. [quoted text clipped - 60 lines] > What we need to do is stop bringing in a whole row, just to get the pk and > timestamp; and only bring in the whole row when it is absoutely necessary. Ian Boyd - 13 Mar 2007 16:49 GMT > Microsoft does recommend putting an index on the timestamp column. Their > customers need the flexibility to create indexes and tables as they see [quoted text clipped - 4 lines] > Rather than going around in circles with me you should go to connect and > make your suggestions there and have Microsoft respond to them. Where is this place?
Hilary Cotter - 13 Mar 2007 17:23 GMT https://connect.microsoft.com/SQLServer/
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> Microsoft does recommend putting an index on the timestamp column. Their >> customers need the flexibility to create indexes and tables as they see [quoted text clipped - 6 lines] > > Where is this place? Ian Boyd - 22 Mar 2007 18:16 GMT > https://connect.microsoft.com/SQLServer/ Checking on the status of my post, looks like they deleted it. Oh well, maybe SQLServerX
Ian Boyd - 09 Mar 2007 14:27 GMT > some of the reasons are that not every table is guaranteed to have a > timestamp column on it From the BOL, regarding "incremental" population:
Incremental population "...This feature requires that the indexed table have a column of the timestamp data type."
Is there anything i'm missing?
Hilary Cotter - 09 Mar 2007 15:54 GMT That is 100% correct, but I wish I have a cent for every dba who has run an incremental population on a table without a timestamp and had a full population performed.
 Signature Hilary Cotter
Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com
>> some of the reasons are that not every table is guaranteed to have a >> timestamp column on it [quoted text clipped - 6 lines] > > Is there anything i'm missing? Ian Boyd - 11 Mar 2007 01:29 GMT > That is 100% correct, but I wish I have a cent for every dba who has run > an incremental population on a table without a timestamp and had a full > population performed. The interface, or the server, should not allow it to happen. It should be an error. Or perhaps:
| Cannot perform an incremental population on a table without a timestamp. | | Would you like to perform a full population instead? | | Yes No The former is just poor user interface design (which i take it is fixed in SQL2005 Management Studio.
|
|
|