Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Other Technologies / Full-Text Search / March 2007

Tip: Looking for answers? Try searching our database.

What is incremental?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Ian Boyd - 23 Feb 2007 20:44 GMT
Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing"
http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx

It says regarding a full population:

   "MSSearch will issue sp_fulltext_getdata once for every row in your
table. So if you have 50 million rows in your database, this procedure will
be issued 50 million times.

   "On the other hand, if you use an incremental population...this
statement will also be issued 50 million times. Why?"

That's exactly what i'd like to know. He continues:

   "Because even with an incremental population, MSSearch must figure out
exactly which rows have been changed, updated, and deleted."

So why is the incremental populator not getting the id of the rows that have
changed after the last timestamp? What is so hard about:

SELECT PrimaryKeyColumn
FROM SomeTable
WHERE timestamp > 0000123456789

If full-population calls sp_fulltext_getdata for every row in the table,
and incremental-population calls sp_fulltext_getdata  for every for in the
table,
what is the difference between full and incremental populating?

And while we're at it, can we please, for the love of everything holy, fix
EM so i can manage a full-text catalog from my development. And can be
please, for the love of everything holy, include some way to see what
populating method and change tracking scheme the full text indexing is
using. And can we please, for the love of everything holy, let EM show if FT
Search is running on the server. And while we're at it, can be please, for
the love of everything holy, let me manage change tracking from EM. And
while we're at it, can we please, for the love of everything holy, document
this stuff in BOL. And while we're at it, can we please, for the love of
everything holy shut me up.
Hilary Cotter - 23 Feb 2007 21:36 GMT
Hmmm........ this article looks familiar. Basically the indexer extracts the
entire row to look at the value of the timestamp column to determine if it
has been modified since the last time it looked at that row. If the
timestamp column is modified the full-text indexed columns will be
re-indexed.

It also checks to see if new rows are added and then compares all the key
and timestamp info returned from all the row with what is in the index to
figure out what has been deleted.

The problem with this query SELECT PrimaryKeyColumn
FROM SomeTable
WHERE timestamp > 0000123456789

is it doesn't detect deletes.

A full-index blindly reinidexes everything. The incremental only indexes the
rows which have been modified or newly inserted.  So it does more
bookkeeping than indexing and can't detect if the update was on a column
which is not being full-text indexed.

While this might sound inefficient, it actually is very efficient and
reliable.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

> Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing"
> http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx
[quoted text clipped - 35 lines]
> document this stuff in BOL. And while we're at it, can we please, for the
> love of everything holy shut me up.
Ian Boyd - 26 Feb 2007 22:04 GMT
> is it doesn't detect deletes.

What if we call
   sp_fulltext_getdata CatalogID, object_id

Which returns a list of ALL rows in the table, and their timestamps. Any
rows in our full-text catalog, but not in the returned list, have been
deleted.

Then, we call
   sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID

only for those rows that that have a new timestamp than our largest.

So we get the best of both worlds. We find deletes, and we only scan rows
changed that have changed since our last incremental scan.
Hilary Cotter - 27 Feb 2007 03:20 GMT
Yes, you can do this, just keep in mind that each time you fire this proc
you reset an internal counter saying this is the highest timestamp, so each
time you fire it MSSearch starts thinking that the timestamp you generated
by firing this proc manually is its reference point and it can miss modified
rows.

Basically when fire sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID
yourself it throws away the values, in other words only when MSSearch fires
this proc does the row get reindexed.

So, I really don't recommend it.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> is it doesn't detect deletes.
>
[quoted text clipped - 12 lines]
> So we get the best of both worlds. We find deletes, and we only scan rows
> changed that have changed since our last incremental scan.
Ian Boyd - 01 Mar 2007 14:54 GMT
> Yes, you can do this, just keep in mind that each time you fire this proc
> you reset an internal counter saying this is the highest timestamp, so
[quoted text clipped - 5 lines]
> RowPrimayKeyID yourself it throws away the values, in other words only
> when MSSearch fires this proc does the row get reindexed.

i wasn't meaning me. i meant why isn't SQL Server doing this since it's
faster and more efficient?
Hilary Cotter - 01 Mar 2007 17:33 GMT
I believe it is, it first fires the proc to get a list of what has change
and updates the time stamp, then it gathers those rows and indexes them, and
then it checks to see what has been deleted.

Why do you think it works differently?

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Yes, you can do this, just keep in mind that each time you fire this proc
>> you reset an internal counter saying this is the highest timestamp, so
[quoted text clipped - 8 lines]
> i wasn't meaning me. i meant why isn't SQL Server doing this since it's
> faster and more efficient?
Ian Boyd - 03 Mar 2007 05:12 GMT
> Why do you think it works differently?

From the article "10 Ways to Optimize SQL Server Full-Text Indexing"
http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx

   "MSSearch will issue sp_fulltext_getdata once for every row in your
table. So if you have 50 million rows in your database, this procedure will
be issued 50 million times.

   "On the other hand, if you use an incremental population...this
statement will also be issued 50 million times."

Perhaps it's a typo, and was supposed to read:
   "On the other hand, if you use an incremental population...this
statement will NOT be issued 50 million times, but rather will only be
executed once for each modified row in the table. Normally there will be
only one or two rows, rather than 50 million."
Hilary Cotter - 05 Mar 2007 13:15 GMT
Nope, its not a typo. The row has to be extracted to get the timestamp for
that column. Consider an you kick off your indexing at 12:00, it extracts
that row at 12:50. Now, suppose that row has modified since 12:00. The
indexer has to know whether to index that row or not. The 12:00 row set says
it doesn't need to, but the Indexer needs a more up to date value, hence it
grabs the most recent timestamp.

Make sense?

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Why do you think it works differently?
>
[quoted text clipped - 14 lines]
> executed once for each modified row in the table. Normally there will be
> only one or two rows, rather than 50 million."
Ian Boyd - 05 Mar 2007 19:38 GMT
> Nope, its not a typo. The row has to be extracted to get the timestamp for
> that column.

Consider the indexing kicks off at 12:00. It gets a lit of all primary keys
in the table, and that row's timestamp:

LastTimestamp    PrimaryKey
=========    =======
30043        1
32364        2
30044        3
...
32363        49584682
etc.

Any rows not in this list get purged from the FullText catalog.

Next the full-text indexer know that when it last crawled the table, the
timestamp was at 32363. So it can search through the above list, and only
fetch rows that have a newer timestamp.

Timestamp    PrimaryKey
=======    =======
32364            3

And so rather than having to pull the contents of 50million rows, we only
pull the contents of one row.  That's a speed improvement by a factor of 50
million!

And if a row was modified during the 2 seconds it took to do our incremental
scan, we can just scan again - since we've sped up the incremental indexing
by a factor of 50 million, we can afford to scan more frequently. Perhaps
even as quickly as every 6 or 7 seconds.

> Consider an you kick off your indexing at 12:00, it extracts that row at
> 12:50. Now, suppose that row has modified since 12:00. The indexer has to
> know whether to index that row or not. The 12:00 row set says it doesn't
> need to, but the Indexer needs a more up to date value, hence it grabs the
> most recent timestamp.

This sounds like a full-scan, where we scan every row no matter what.  i'm
talking about an incremental full-text catalog build.  Why does incremental
not do as i've described? Everyone who comes into SQL Server Full-Text
Indexing assumes it does an incremental scan - and then cannot figure out
why it's scanning the whole table.

If you're trying to tell me that incremental and full do the same thing
(i.e. scan every row in the table), then i return to my original question:
   What is incremental?
Hilary Cotter - 06 Mar 2007 21:05 GMT
but the timestamp retrieved is the value of the timestamp when the first
list was obtained. It then needs to retrieve the timestamp again to see if
it has changed since the first full-list retrieval. I suspect the first
full-list is to get the crawl seeds and more importantly to figure out what
to delete.

Consider this case. You get the highest value timestamp and compare it with
the highest value timestamp from the last run. You only index those rows.
What about the rows which have modified while you are indexing this delta.
By only doing this delta you miss those ones.

You might want to make your case on connect to see what the office MS answer
is.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Nope, its not a typo. The row has to be extracted to get the timestamp
>> for that column.
[quoted text clipped - 45 lines]
> (i.e. scan every row in the table), then i return to my original question:
>    What is incremental?
Ian Boyd - 08 Mar 2007 05:32 GMT
> but the timestamp retrieved is the value of the timestamp when the first
> list was obtained. It then needs to retrieve the timestamp again to see if
[quoted text clipped - 6 lines]
> rows. What about the rows which have modified while you are indexing this
> delta. By only doing this delta you miss those ones.

You're describing a solution to a problem that doesn't need to be solved. At
the very least full repopulation doesn't solve this problem either.

Consider full-repopulation:

12:00:00pm: Get list of all rows and their timestamps. Only row 37983 has
changed
12:00:02pm: Crawl row 1, in case it changed since we got the list
12:00:02pm: Crawl row 2, in case it changed since we got the list
....
12:49:37pm: Crawl row 37983, because we know it changed
12:49:37pm: Row 3 changes - after we've crawled it
12:49:37pm: Ccrawl row 37984, in case it changed since we got the list
...
6:13:31pm: Crawl row 50123456, in case it changed since we got the list
6:13:31pm: Crawl complete.

But we missed the change to row 3. The full-repopulation STILL misses the
update. So i'll ask you: "What about the rows which have modified while you
are indexing this delta?"

Does the full rebuild algorithm perform endless passes until no rows have
changed during the full scan? No, there would never be an end. This is an
accepted limitation of the algorithm, and any changes we'll pick up the next
time a full repopulation is run. What's more, because a full repopulation is
so unnecessarily lengthy, it essentially guarantees that rows will change
during the full re-scan - and that they will get missed.

So, my original question: If an increment rebuild is the same as a full
rebuild, what's the difference between incremental and full? Or, more
simply, "What is incremental?"

Now consider incremental-repopulation as everyone assumes it works:

12:00:00pm: Get list of all rows and their timetamps. Only row 37983 has
changed
12:00:02pm: Row 3 changes
12:00:02pm: Crawl row 23983, because it has changed
12:00:02pm: Crawl complete

In both examples we miss a change to a row. But in the ideal incremental
rebuild algorithm, the window where that change could have happened is much
smaller (on the order of 50 million times smaller). And since we've already
accepted that if row changes after we've examined it, then it will just have
to get picked up next time the incremental (or full) population is run.
Additionally, since the incremental population only took 2 seconds, we can
run it much more frequently; getting updates in near realtime.

The difference is: the incremental algorithm is faster, with less load on
the server, and row changes making it into the full text catalog faster. At
least, that's what every MSSQL dba assumes the incremental repopulation
does. Until they begin seeing high load on their server, and they post into
m.p.s.f asking what SQL Server is doing.

My collary question is: why isn't SQL Server doing this already?  There's
the "full" algorithm. Is there any other algorithm?  Which brings me to my
primary question:

Is there an "incremental" algorithm of some sort? If yes, what does the
incremental repopulation algorithm do differently from the "full"? What is
incremental?

According to an article on TechNet, there's no difference. So then, why the
two names?

> You might want to make your case on connect to see what the offial MS
> answer is.
i don't know what connect is. Nor do i believe i would get an answer.
Simon Sabin - 08 Mar 2007 20:20 GMT
Hello Ian,

Isn't the difference that incremental looks to see if the record has changed
and reindex if required. With full it just reindexes the record.

Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons

>> but the timestamp retrieved is the value of the timestamp when the
>> first list was obtained. It then needs to retrieve the timestamp
[quoted text clipped - 80 lines]
>>
> i don't know what connect is. Nor do i believe i would get an answer.
Ian Boyd - 08 Mar 2007 21:05 GMT
> Isn't the difference that incremental looks to see if the record has
> changed and reindex if required. With full it just reindexes the record.

i see what he's trying to say.

Incremental offers no performance improvement over full, because it still
reads every single row in the database, one by one, not even as a set of
rows.
Hilary Cotter - 09 Mar 2007 02:05 GMT
Both full and incremental extract each row. Incremental will not index each
row, but full will. Still full can be faster than incremental as it doesn't
have to do any bookeeping or index maintenance.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Isn't the difference that incremental looks to see if the record has
>> changed and reindex if required. With full it just reindexes the record.
[quoted text clipped - 4 lines]
> reads every single row in the database, one by one, not even as a set of
> rows.
Hilary Cotter - 08 Mar 2007 20:43 GMT
let me try to explain this again.

Full population. Every row is extracted and indexed.

Incremental population. Every row is extracted, timestamps are compared with
the last timestamp the last full or incremental population ended on and if
the timestamp is different the extracted row is reindexed. Rows which are no
longer in the table but in the catalog are removed from the catalog. Rows
which are not in the catalog are inserted.

Does this make sense? The cost of bringing the entire row is not much more
expensive from just bringing the timestamp so its not that inefficient.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> but the timestamp retrieved is the value of the timestamp when the first
>> list was obtained. It then needs to retrieve the timestamp again to see
[quoted text clipped - 76 lines]
>> answer is.
> i don't know what connect is. Nor do i believe i would get an answer.
Ian Boyd - 08 Mar 2007 21:03 GMT
> Full population. Every row is extracted and indexed.
>
[quoted text clipped - 3 lines]
> which are no longer in the table but in the catalog are removed from the
> catalog. Rows which are not in the catalog are inserted.

If i understand what you're staying - then someone the SQL team needs a
smack.

Both "full" and "incremental" both scan scan every row, one by one, in the
database, whether the row has changed or not. Why, in the name of god, would
they not have an actual "incremental" scan? How is this not a no-brainer?

> Does this make sense? The cost of bringing the entire row is not much more
> expensive from just bringing the timestamp so its not that inefficient.

i disagree. Put an index on PK, Timestamp.

Fetching every row in the database, one by one; it's like using a cursor to
do everything. There's so much to be gained by using SQL Server to do what
it's good at: operate on sets.

My god, they really should add an "incremental" update option.
Simon Sabin - 08 Mar 2007 23:14 GMT
Hello Ian,

Thats what change tracking is for.

Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons

>> Full population. Every row is extracted and indexed.
>>
[quoted text clipped - 24 lines]
>
> My god, they really should add an "incremental" update option.
Hilary Cotter - 09 Mar 2007 00:47 GMT
some of the reasons are that not every table is guaranteed to have a
timestamp column on it, nor can Microsoft guarantee that their user's won't
update the pk's. Suppose you update a pk on your table. How is the indexer
supposed to know what row the data in the catalog belongs to? Microsoft
can't count on their users being as sophisticated as you obviously are.

As Simon mentions change tracking is designed to provide real time indexing.
Full-populations are designed when the majority of your data changes at one
time. Incremental is designed when large portions of your data changes at
one time.

Keep in mind that the MSSearch engine provided search services to Exchange
Content Indexing, MS Office search, Site Server Search and SharePoint Portal
Search. They needed a consistent crawl or population method for all
population/crawl types.

In SQL 2005 the crawl mechanism has been totally re-written as a database
only indexer. Still incremental population remains as it is the fastest
crawl/population type in some cases.

Its the bookkeeping that MSSearch must do in order to reconcile what is in
the catalog with what is in the table which makes the extraction all the
rows in the base table essential.

This crawl algorithm is also present in Sybase and Oracle last time I
looked.

Signature

Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Full population. Every row is extracted and indexed.
>>
[quoted text clipped - 23 lines]
>
> My god, they really should add an "incremental" update option.
Ian Boyd - 09 Mar 2007 14:22 GMT
> some of the reasons are that not every table is guaranteed to have a
> timestamp column on it

i'd be fine with a timestamp being required to to FT indexing; like some
forms of replication need a "IsUniqueIdentifier" row.

> nor can Microsoft guarantee that their user's won't update the pk's.

You upate the PK, but don't forget that the timestamp updates with it. We'll
get the "new" values associated with that row next run.

> As Simon mentions change tracking is designed to provide real time
> indexing. Full-populations are designed when the majority of your data
> changes at one time. Incremental is designed when large portions of your
> data changes at one time.

But both heavily impact SQL Server to the same degree.

> Keep in mind that the MSSearch engine provided search services to Exchange
> Content Indexing, MS Office search, Site Server Search and SharePoint
> Portal Search. They needed a consistent crawl or population method for all
> population/crawl types.

It's an implementation detail that Microsoft SQL Server implements it's
FullText search ability as a separate process. Integrate it into the engine,
and it's not a problem.  If they need to choose a different implementation:
i'm fine with that.

> In SQL 2005 the crawl mechanism has been totally re-written as a database
> only indexer. Still incremental population remains as it is the fastest
> crawl/population type in some cases.

If this goes toward my previous statement then that's excellent news.

> Its the bookkeeping that MSSearch must do in order to reconcile what is in
> the catalog with what is in the table which makes the extraction all the
> rows in the base table essential.

See points 1 and 2

> This crawl algorithm is also present in Sybase and Oracle last time I
> looked.

Which means there is room for improvement and innovation on Microsoft's part
to make a superior product.
Hilary Cotter - 09 Mar 2007 15:53 GMT
Comments in line.
Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> some of the reasons are that not every table is guaranteed to have a
>> timestamp column on it
>
> i'd be fine with a timestamp being required to to FT indexing; like some
> forms of replication need a "IsUniqueIdentifier" row.

The unique identifier used in replication is used to track which rows have
changed in which tables and which servers. It does not have a timestamp
component. While you could use it for a timestamp it is too wide.

>> nor can Microsoft guarantee that their user's won't update the pk's.
>
> You upate the PK, but don't forget that the timestamp updates with it.
> We'll get the "new" values associated with that row next run.

That's right you update the pk, the timestamp tell the indexer that it has
been updated since the last time, but the key information is gone. So last
run it had a key value of 1 and the catalog knows that X number of words in
the index belong to row 1, but now the value of row 1 is 13424. How does it
reconcile this to remove these enteries and update them with the correct new
pk value?

>> As Simon mentions change tracking is designed to provide real time
>> indexing. Full-populations are designed when the majority of your data
>> changes at one time. Incremental is designed when large portions of your
>> data changes at one time.
>
> But both heavily impact SQL Server to the same degree.

Not always. The extraction is typically less expensive than the indexing,
especially when you are dealing with blobs and Far Eastern content.

>> Keep in mind that the MSSearch engine provided search services to
>> Exchange Content Indexing, MS Office search, Site Server Search and
[quoted text clipped - 5 lines]
> engine, and it's not a problem.  If they need to choose a different
> implementation: i'm fine with that.

In SQL 2005 the indexer is in process.

>> In SQL 2005 the crawl mechanism has been totally re-written as a database
>> only indexer. Still incremental population remains as it is the fastest
[quoted text clipped - 7 lines]
>
> See points 1 and 2

Please see my answers to them:)

>> This crawl algorithm is also present in Sybase and Oracle last time I
>> looked.
>
> Which means there is room for improvement and innovation on Microsoft's
> part to make a superior product.

Which means you better go looking to smack the entire search engine
developers, not just the MS ones.
Ian Boyd - 10 Mar 2007 23:08 GMT
>> i'd be fine with a timestamp being required to to FT indexing; like some
>> forms of replication need a "IsUniqueIdentifier" row.
>
>While you could use it for a timestamp it is too wide.

No, i was saying that's there's already a precident for mandating a column's
presence in a table to be able to use a feature. So there's no problem in
mandating a timestamp be present in order to use incremental population

> That's right you update the pk, the timestamp tell the indexer that it has
> been updated since the last time, but the key information is gone. So last
> run it had a key value of 1 and the catalog knows that X number of words
> in the index belong to row 1, but now the value of row 1 is 13424. How
> does it reconcile this to remove these enteries and update them with the
> correct new pk value?

Okay, i guess i need to have a *detailed* example to explain how this works.

Assume we want to catalog a simple Customers table

CREATE TABLE Customers (
   pk int NOT NULL PRIMARY KEY,
   Firstname varchar(25),
   Lastname varchar(25),
   [timestamp] timestamp)

Now, assume this table has been pre-filled with two rows:

pk    Firstname    Lastname    timestamp
1    Ian    Boyd    970001
2    Hillary    Cotter    970002

Incremental Population Nº1
=====================
Let's do our "initial" incremental population. We record the last timestamp
every time we run the incremental population. Since this is the first time
an incremental population has been run, we assume a "LastTimestamp" of zero.

The first step in performing an incremental population is to get every
Primary key value, and it's timestamp.

NOTE: Perhaps there is some special stored procedure that can be used to get
this list; perhaps being called sp_fulltext_getdata. We won't use that for
my algorithm, we're use something more clear - for the purposes of
explaining the incremental population algorithm.

> SELECT pk, timestamp FROM Customers

pk    timestamp
1    970001
2    970002
2 row(s) affected

These are all the rows the table contains. Any entries in our full-text
catalog  that do not exist in this list need to be removed from the
full-text catalog; since they have been deleted from the source database. In
our example, this is the first time the full-text indexer has been run on
our Customers table, so the full-text catalog is empty.

Next, we scan the recordset, looking for any rows with a timestamp greater
than zero. We find two rows:

pk    timestamp
1    970001
2    970002

So, there are two rows that need to go into the full-text catalog. The
full-text indexer issues a statement to get the contents of the Customers
table for primary key value 1.

NOTE: This can be some special SQL Server stored procedure (perhaps named
sp_fulltext_getdata) that is used to get the values of columns for a
specific row. We're going to do it the obvious way; for the purposes of
explaining this algorithm

> SELECT * FROM Customers WHERE pk = 1

pk    Firstname    Lastname    timestamp
1    Ian    Boyd    970001
1 row(s) affected

The full-text application now takes those values, and parses them into
whatever format it uses and stores them wherever it stores them in the
catalog. Now it get's the next row:

> SELECT * FROM Customers WHERE pk=2

pk    Firstname    Lastname    timestamp
2    Hillary    Cotter    970002
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.

Now let's say someone modifies a row:

> UPDATE Customers SET Firstname = 'Hilary' WHERE pk = 2
1 rows(s) affected

Incremental Population Nº2
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97002. It's first step is to find all
rows in the table, and their timestamps

> SELECT pk, timestamp FROM Customers
pk    timestamp
1    970001
2    970003
2 row(s) affected

First we delete any rows from our catalog that are not in the database.
Since both rows are still there, we proceed to the next phase. Look through
the recordset and find rows that have a timestamp greater than 970002. There
is one row that matches:

pk    timestamp
2    970003

So the indexer fetches the contents of that row:

> SELECT * FROM Customers WHERE pk=2

pk    Firstname    Lastname    timestamp
2    Hilary    Cotter    970002
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.

Now let's say someone updates the primary key in the table

> UPDATE Customers SET pk = 3 WHERE pk = 2

Incremental Population Nº3
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97003. It's first step is to find all
rows in the table, and their timestamps

> SELECT pk, timestamp FROM Customers
pk    timestamp
1    970001
3    970004
2 row(s) affected

First we delete any rows from our catalog that are not in the database. But
now we see that our full-text catalog has data for a row with pk=2, but that
row doesn't exist in the database. So rows with pk=2 is purged from the
full-text catalog, and we proceed to the next phase. Look through the
recordset and find rows that have a timestamp greater than 970003. There is
one row that matches:

pk    timestamp
3    970004

So the indexer fetches the contents of that row:

> SELECT * FROM Customers WHERE pk=3

pk    Firstname    Lastname    timestamp
3    Hilary    Cotter    970004
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.

Now let's say someone is really devious and swaps the PK values of rows 1
and 3

>UPDATE Customers SET pk = 4 WHERE pk = 1
1 rows(s) affected
> UPDATE Customers SETpk = 1 WHERE pk = 3
1 row(s) affected
> UPDATE Customers SET pk = 3 WHERE pk = 4
1 row(s) affected

Incremental Population Nº4
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97004. It's first step is to find all
rows in the table, and their timestamps

> SELECT pk, timestamp FROM Customers
pk    timestamp
1    970007
3    970006
2 row(s) affected

First we delete any rows from our catalog that are not in the database. All
rows in our catalog exist in the database, so we proceed to the next step.
The last timestamp when we ran was 970004, so we look for rows that have a
newer timestamp, and we find two rows:

pk    timestamp
1    970007
3    970006

So the indexer fetches the contents of the the first row:

> SELECT * FROM Customers WHERE pk=1

pk    Firstname    Lastname    timestamp
1    Hilary    Cotter    970007
1 row(s) affected

And the catalog is updated with these values. Now we fetch the other
modified row:

> SELECT * FROM Customers WHERE pk = 3

pk    Firstname    Lastname    timestamp
3    Ian    Boyd    97006
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.

Now lets say someone delete a row:

>DELETE FROM Customers WHERE pk = 3

Incremental Population Nº4
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97007. It's first step is to find all
rows in the table, and their timestamps

> SELECT pk, timestamp FROM Customers
pk    timestamp
1    970007
1 row(s) affected

First we delete any rows from our catalog that are not in the database. In
this case, our full-text catalog as a row with pk=3 that is no longer in the
table, so we purge that row from our catalog. Next we look for all rows that
have a timestamp newer than 97007. There is none, so the indexing is up to
date.

Find for me a case where the algorithm fails

> The extraction is typically less expensive than the indexing, especially
> when you are dealing with blobs and Far Eastern content.

And it's even less taxing to not query for a row needlessly.
Hilary Cotter - 12 Mar 2007 14:42 GMT
Ian you are describing how an incremental population runs with one
exception.
So the indexer fetches the contents of that row:

SQL Server has to poll the entire table to find out what is in it. It then
compares this with what the fulltext catalog has. It then can detect what is
deleted.

An update can be considered a delete followed by an insert.

Your method describes how to detect what is new (inserts and updates), but
it doesn't describe a method to determine what is deleted.

Can you help me to understand how your method detects deletions?

To quote you "First we delete any rows from our catalog that are not in the
database." and "Any entries in our full-text catalog  that do not exist in
this list need to be removed from the full-text catalog; since they have
been deleted from the source database" and "First we delete any rows from
our catalog that are not in the database." and again 'First we delete any
rows from our catalog that are not in the database. " And one more time
"First we delete any rows from our catalog that are not in the database. "

Exactly how does the indexer know what is deleted knowing only what is new?

The algorithm that the crawl mechanism in MSSearch uses is to say send me
all the pks and timestamps and I'll compare this list with what is in the
catalog.

This way I detect what is new (in a similar manner to what you describe),
and also what is to be deleted.

Am I missing something?

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>>> i'd be fine with a timestamp being required to to FT indexing; like some
>>> forms of replication need a "IsUniqueIdentifier" row.
[quoted text clipped - 244 lines]
>
> And it's even less taxing to not query for a row needlessly.
Ian Boyd - 12 Mar 2007 16:26 GMT
> An update can be considered a delete followed by an insert.
>
> Your method describes how to detect what is new (inserts and updates), but
> it doesn't describe a method to determine what is deleted.
>
> Can you help me to understand how your method detects deletions?

Let's say our table and our catalog are currently in sync

Contents on the Customers table:
pk    Firstname    Lastname    timestamp
1    Ian Alexander    Boyd    970001
2    Hilary Nathan    Cotter    970002

NOTE: i'll now store first and last names in the FirstName field, so that we
can see words being broken up.
NOTE: i do not know the format that is used to store the full-text catalog;
so i'll have to make one up.

Contents of the FullTextCatalog table:
Table    Key    Column    keyword
Customers    1    Firstname    Ian
Customers    1    Firstname    Alexander
Customers    1    Lastname    Boyd
Customers    2    Firstname    Hilary
Customers    2    Firstname    Nathan
Customers    2    Lastname    Cotter

And the last time the incremental population ran the timestamp in the
Customers table was 970002. The full-text indexer also must store this:

Contents of the Full-text catalog LastTimestamps table:
Table    LastTimestamp
Customers    970002

Everything is setup. Now we delete a row:

>DELETE FROM Customers WHERE pk = 2
1 row(s) affected

Now there is only one row in the customer's table:

>SELECT * FROM Customers
pk    Firstname    Lastname    timestamp
1    Ian Alexander    Boyd    970001
1 row(s) affected

Now the incremental indexer comes along. First thing is does it ask for all
key-timestamp pairs:
>SELECT pk, timestamp FROM Customers
pk    timestamp
1    970001
1 row(s) affected

So your question is: How does the indexer know that a row has been deleted?
Specifically, how does it know that row 2 has been deleted?

Looking through the full-text catalog, we find all rows that we've
previously indexed:

>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers'
pk
1
2
2 row(s) affected.

Our full-text catalog contains values from two rows. But in the list from
the database that we fetched a moment ago there is only one row:

>SELECT pk, timestamp FROM Customers
pk    timestamp
1    970001
1 row(s) affected

Obviously row 2, that exists in our full-text catalog, has been delete from
the source database, and must be deleted from our full-text catalog.

>DELETE FROM FullTextCatalog WHERE Table='Customers'
>AND Key = 2
3 row(s) affected

This is how we detect deletes.

Perhaps now you're wanting to contrive a more pessimistic case. Let's take
the original values:

pk    Firstname    Lastname    timestamp
1    Ian Alexander    Boyd    970001
2    Hilary Nathan    Cotter    970002

and delete Hilary, and switch Ian to pk value #2, resulting in:

pk    Firstname    Lastname
2    Ian Alexander    Boyd

So now we've deleted a row, but also moved another row on top of it. How do
we handle this?

Well, starting from:
pk    Firstname    Lastname    timestamp
1    Ian Alexander    Boyd    970001
2    Hilary Nathan    Cotter    970002

let's do the SQL to make the change:

>DELETE FROM Customers WHERE pk = 2
1 row(s) affected

Giving us:
pk    Firstname    Lastname    timestamp
1    Ian Alexander    Boyd    970001

>UPDATE Customers SET pk = 2 WHERE pk = 1
1 row(s) affected

Giving us:
pk    Firstname    Lastname    timestamp
2    Ian Alexander    Boyd    970003

Now the incremental indexing starts, and it's first step is to get a list of
all pk-timestamp pairs:

>SELECT pk, timestamp FROM Customers
pk    timestamp
2    970003
1 row(s) affected

Now we know that a row has been deleted, but how does the full-text indexer
know? It looks through it's full-text catalog:

>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers'
pk
1
2
2 row(s) affected.

The catalog has two rows, but the current able only has one. Specifically,
row with pk=1 is no longer in the source database, while it is still in our
catalog. So let's remove that row from the catalog:

>DELETE FROM FullTextCatalog
>WHERE TableName = 'Customers'
>AND Key = 1
1 row(s) affected

"Ah-hah!", you say. "How does it know that row 2 is not the same row 2 in
the catalog?" From the timestamps. The indexer stores the last timestamp
that was in a table the last time in ran:

>SELECT * FROM LastTiemeStamps
>WHERE Table = 'Customers'
Table    LastTimestamp
Customers    970002
1 row(s) affected.

So any rows that have a timestamp greater than 970002 need to be
re-populated. Seeking through the list in memory that we already retreived
(which i'll reproduce here for convience):

pk    timestamp
2    970003

indicates that one row has been modified.

So the full-text indexer throws away any data in it's catalog for Key=2, and
puts in it's place values from the current row of pk=2.

Makes sense?
Hilary Cotter - 12 Mar 2007 17:23 GMT
What you describe is exactly how the indexer works. It gets stores the last
highest timestamp form the last run.

It uses this as the basis to find out what is new. It then extracts each pk
and timestamp to determine what to delete and by extension what has been
updated. It also extracts the content of the row at that time as the expense
of this hit is not much more than the cost of getting the timestamp and the
pk.

This is what I tried to explain with my statement "Incremental population.
Every row is extracted, timestamps are compared with
the last timestamp the last full or incremental population ended on and if
the timestamp is different the extracted row is reindexed. Rows which are no
longer in the table but in the catalog are removed from the catalog. Rows
which are not in the catalog are inserted.

Does this make sense? The cost of bringing the entire row is not much more
expensive from just bringing the timestamp so its not that inefficient."

Your response was "If i understand what you're staying - then someone the
SQL team needs a smack."

Am I missing something? Do I need to be smacked? Did I do a poor job of
explaining this?

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> An update can be considered a delete followed by an insert.
>>
[quoted text clipped - 166 lines]
>
> Makes sense?
Ian Boyd - 12 Mar 2007 19:40 GMT
> What you describe is exactly how the indexer works. It gets stores the
> last highest timestamp form the last run.
[quoted text clipped - 4 lines]
> expense of this hit is not much more than the cost of getting the
> timestamp and the pk.

No, the hit *is* much more. We will, of course, have a covering index on
   pk, Timestamp

so that when the full-text indexer performs the query

>SELECT pk,timestamp FROM Customers

it only has to scan the index. This is much faster than having to scan every
row of the database.

After that, once we've found the few rows that have changed, it's much
better for the server to only fetch those specific few rows, rather than
querying every row in the table one. Hell, i'd prefer it if we didn't have
to issue a separate select for each modified row, but if we could use a
set-based operation

>SELECT * FROM Customers
>WHERE TimeStamp > 970002

Assuming there's less than a few hundred modified rows.

Do you disagree that having a convering index on pk,timestamp and only
querying for few dozen modified rows is better than querying for all rows in
a table, whether they are used are not - and not even as a set or a batch,
but one row at a time?

Assuming a 50,000,000 row table. If the pk is a 4-byte integer, and the
timestamp is a 4-byte ordinal, and you create a covering index on those two,
you can fit a little over 1000 rows in an 8k page. That's works out to about
50k pages of i/o. If there is then a dozen modified rows, and those rows are
randomly scattered around the database, that's 84 more page reads (assuming
b-tree traversal is log(n), with 6,250,000 pages), for a total of 50,084
pages of i/o.

On the other hand, using the current SQL Server incremental algorithm, we
have to scan every page in a table. Assume each row in the table is 1000
bytes wide. That's about 8 rows per page. That then requires 6,250,000 pages
of i/o to get the entire database.

50,084 pages to read
vs
6,250,000 pages to read

That's about 125x more i/o generated using the "brute force" method, over
than the finesse method.

So, this is where i come to the point in my head asking, "Why do i have to
explain this to the SQL Server team?"

> Does this make sense? The cost of bringing the entire row is not much more
> expensive from just bringing the timestamp so its not that inefficient."

What we need to do is stop bringing in a whole row, just to get the pk and
timestamp; and only bring in the whole row when it is absoutely necessary.
Hilary Cotter - 13 Mar 2007 04:35 GMT
Microsoft does recommend putting an index on the timestamp column. Their
customers need the flexibility to create indexes and tables as they see fit.
Best practices might dictate using a covering index in some cases. For
example some application might require the dba not modify the table
structure in any way.

Rather than going around in circles with me you should go to connect and
make your suggestions there and have Microsoft respond to them.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> What you describe is exactly how the indexer works. It gets stores the
>> last highest timestamp form the last run.
[quoted text clipped - 60 lines]
> What we need to do is stop bringing in a whole row, just to get the pk and
> timestamp; and only bring in the whole row when it is absoutely necessary.
Ian Boyd - 13 Mar 2007 16:49 GMT
> Microsoft does recommend putting an index on the timestamp column. Their
> customers need the flexibility to create indexes and tables as they see
[quoted text clipped - 4 lines]
> Rather than going around in circles with me you should go to connect and
> make your suggestions there and have Microsoft respond to them.

Where is this place?
Hilary Cotter - 13 Mar 2007 17:23 GMT
https://connect.microsoft.com/SQLServer/

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> Microsoft does recommend putting an index on the timestamp column. Their
>> customers need the flexibility to create indexes and tables as they see
[quoted text clipped - 6 lines]
>
> Where is this place?
Ian Boyd - 22 Mar 2007 18:16 GMT
> https://connect.microsoft.com/SQLServer/

Checking on the status of my post, looks like they deleted it.  Oh well,
maybe SQLServerX
Ian Boyd - 09 Mar 2007 14:27 GMT
> some of the reasons are that not every table is guaranteed to have a
> timestamp column on it

From the BOL, regarding "incremental" population:

Incremental population
"...This feature requires that the indexed table have a column of the
timestamp data type."

Is there anything i'm missing?
Hilary Cotter - 09 Mar 2007 15:54 GMT
That is 100% correct, but I wish I have a cent for every dba who has run an
incremental population on a table without a timestamp and had a full
population performed.

Signature

Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

>> some of the reasons are that not every table is guaranteed to have a
>> timestamp column on it
[quoted text clipped - 6 lines]
>
> Is there anything i'm missing?
Ian Boyd - 11 Mar 2007 01:29 GMT
> That is 100% correct, but I wish I have a cent for every dba who has run
> an incremental population on a table without a timestamp and had a full
> population performed.

The interface, or the server, should not allow it to happen. It should be an
error. Or perhaps:

|   Cannot perform an incremental population on a table without a timestamp.
|
|   Would you like to perform a full population instead?
|
|        Yes        No

The former is just poor user interface design (which i take it is fixed in
SQL2005 Management Studio.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.