SQL Server Forum / Other Technologies / Full-Text Search / January 2005
FreeTextTable Rank
|
|
Thread rating:  |
msnews.microsoft.com - 27 Jan 2005 06:02 GMT Hi. I have a question - does anyone know on what criteria is the rank set by FreeTextTable ? I created some records in the table "titles" containing the words "drink", "much" and "abstinence" and.... well.... I can find no rhyme or reason to the way this rank is established. I got the ranks 293, 266, 266 and 154. The funny thing is that a title like "No *abstinence* for me, please !" has a higher rank than "ABSTINENCE for dummies." with the word in capitals.
So... how exactly is this rank set ? How reliable is it ?
I'm "afraid" to filter the returns (for instance TOP N) and I am "afraid" to set conditions like "where rank >...." (of course, I could always show all the hits and page the datagrid).
Thanks a lot. Alex.
msnews.microsoft.com - 27 Jan 2005 06:29 GMT After some searching, I found this (for SQL2005):
Ranking of FREETEXT
Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in the query is ranked, and the values are summed. Freetext queries will add words to the query via inflectional generation (stemmed forms of the original query terms); these words are treated as separate terms with no special weighting or relationship with the words from which they were generated. Synonyms generated from the Thesaurus feature are treated as separate, equally weighted terms.
Rank = ?[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 ) qtf / ( k3 + qtf ) ) )
Where: w is the Robertson-Sparck Jones weight.
Originally, w is defined as: w = log10 ( ( ( r + 0.5 ) * ( N - n - R + r + 0.5 ) ) / ( ( R - r + 0.5 ) * ( n - r + 0.5 ) ) ) This was simplified to: w = log10 ( ( ( r + 0.5 ) * ( N - R + r + 0.5 ) ) / ( ( R - r + 0.5 ) * ( n - r + 0.5 ) ) )
R is the number of documents marked relevant by a user. This is not implemented in SQL Server 2005 full-text search, and thus is ignored. r is the number of documents marked relevant by a user containing the term. This is not implemented. N is the number of documents with values for the property in the query. n is the number of documents containing the term. K is ( k1 * ( ( 1 - b ) + ( b * dl / avdl ) ) ) dl is the document length, in word occurrences. avdl is the average document length of the property over which the query spans, in word occurrences. k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively. tf is the frequency of the term in a specific document. qtf is the frequency of the term in the query.
So now I know :-)))) Anyway, apparently the ranking results have to be taken with a *BIG" grain of salt.
Alex.
John Kane - 27 Jan 2005 07:22 GMT Alex, Yep, so you found (via Google?) the SQL 2005 FTS paper published in Dec 2004. While it documents SQL 2005, I *believe* that as far as the contains and freetext ranking formula's that they hold true for SQL Server 2000 as well. Even so, I'm surprised that for SQL Server 2005 (or for that matter SQL 2000) that relevance feedback (R & r in the formula) was not implemented as there are standard methods using T-SQL and feedback tables that can be use to implement automatic relevance feedback...
Oh, and that *BIG" grain of salt, that you speak of, not necessary... You now know the formula, and with your table's unique non-noise words, you can calculate the rank values from your query by yourself!
Regards, John
 Signature SQL Full Text Search Blog http://spaces.msn.com/members/jtkane/
> After some searching, I found this (for SQL2005): > [quoted text clipped - 39 lines] > > Alex. John Kane - 27 Jan 2005 07:03 GMT Hi Alex, Yes, I do, but it is complex and you need to understand a bit about basic Informational Retrieval theory. The Rank value from FREETEXTTABLE is based upon what is known as OKAPI or BM25 that was developed by Stephen Robertson (http://research.microsoft.com/users/robertson/). You can see some of the formula documented in "SQL Server 2005 Full-Text Search: Internals and Enhancements" at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sq l2005ftsearch.asp Specifically, under "Ranking of FREETEXT" -
"Freetext ranking is based on the OKAPI BM25 ranking formula. Each term in the query is ranked, and the values are summed. Freetext queries will add words to the query via inflectional generation (stemmed forms of the original query terms); these words are treated as separate terms with no special weighting or relationship with the words from which they were generated. Synonyms generated from the Thesaurus feature are treated as separate, equally weighted terms."...
See also http://wickedsmrt.blogspot.com/2003_05_01_wickedsmrt_archive.html "From MS Newsgroups: However it was my understanding that rank is based on this formula: ? W(i)=(K1+1) ?idf(i)?(K2+1) ?tf(i.j) /( K1?[(1-b)+b?dl (j)/avdl])?K3(tf(i.j))
W(i) - rank from each term in the search phrase idf(i) - iS the inverse document frequency of term i tf(i,j) - is the term frequency for term i, in row j K1,K2,K3 - are constants dl - is row/column length in words AVdl - is the average row/column length length in words"
Another factor is the number of rows and the number of unique non-noise words per row as you must have a statistically significant number of rows (at least 10,000) for the OKAPI BM25 Freetexttable Rank value to be meaningful. How many rows are in your table "titles"? Also keep in mind that the Rank values are specific to your freetext query and primarily useful for ordering of the results. See SQL Server 2000 BOL title "Full-text Search Recommendations" - "What is RANK and how is it determined when used with CONTAINSTABLE and FREETEXTTABLE predicates?..." for more info.
Yes, OKAPI BM25 is very reliable, but complex. Note, that for US English, SQL FTS is case insensitive and "ABSTINENCE" will have the same rank value as "abstinence" with all other factors being equal. It is complex as you can see, but what is your true objective? Could you provide the exact Freetext query with sample data and results as well as the full output of SELECT @@version ?
Hope that helps! John
 Signature SQL Full Text Search Blog http://spaces.msn.com/members/jtkane/
> Hi. I have a question - does anyone know on what criteria is the rank set by > FreeTextTable ? I created some records in the table "titles" containing the [quoted text clipped - 12 lines] > Thanks a lot. > Alex. msnews.microsoft.com - 27 Jan 2005 07:24 GMT Hello, John. Thank you for your reply.
Yes, I understand. Unfortunately, I'm using the pubs db, and I only have about 50 titles in the 'Titles' table. Here are the results of my query: -------------------------------------------------------------------------------------------------------------------------- Rank/Title/Notes/First Name/Last Name
375/We don't drink that much/Essay on abstinence./Napoleon/Borcan
266/Que c'est bon, c'est bon, c'est bon !/Essai sur les joies de boire beaucoup. Abstinents s'abstenir (eng: abstinence)./Pisica/Rindunel
121/No *abstinence* for me, please !/A smoker's paradise. Smoke-smoke-smoke, boy, oh, isn't it a joy ?/Alberta/Curisor
121/ABSTINENCE for dummies./An epicurian's guide./Pupu/Balacarescu --------------------------------------------------------------------------------------------------------------------------
and this is the query, which is correct:
-------------------------------------------------------------------------------------------------------------------------- strSearch = "SELECT " & _ "SearchTable.[Rank], Titles.title as Title, Titles.notes as Notes, Authors.au_fname as [First Name], Authors.au_lname as [Last Name]" & _ "FROM " & _ "FREETEXTTABLE(Titles, *, '" & strText & "') as SearchTable " & _ "INNER JOIN Titles ON SearchTable.[Key] = Titles.title_id " & _ "INNER JOIN TitleAuthor ON Titles.title_id = TitleAuthor.title_id " & _ "INNER JOIN Authors ON TitleAuthor.au_id = Authors.au_id " & _ "ORDER BY SearchTable.[Rank] DESC" -------------------------------------------------------------------------------------------------------------------------- The version is: Microsoft SQL Server 2000 - 8.00.194 (Intel X86) Aug 6 2000 00:57:48 Copyright (c) 1988-2000 Microsoft Corporation Developer Edition on Windows NT 5.1 (Build 2600: Service Pack 2)
Alex.
> Hi Alex, > Yes, I do, but it is complex and you need to understand a bit about basic [quoted text clipped - 72 lines] >> Thanks a lot. >> Alex. John Kane - 27 Jan 2005 07:47 GMT You're welcome, Alex, Yes, I thought so. While the pubs and northwind database tables are good examples for experimenting with SQL FTS queries, they are not large enough to be used effectively with containstable or freetextable and RANK as you need production level table sizes to get meaningful Ranking results from the SQL FTS queries such as the one below.
FYI, I'd recommend that you apply the latest service pack to your SQL Server 2000 (8.00.194) Developer's Edition on WinXP SP2 as the build (194) of SQL Server 2000 that you are using has no service packs applied and you may be open security bugs with this RTM version.
 Signature SQL Full Text Search Blog http://spaces.msn.com/members/jtkane/
> Hello, John. Thank you for your reply. > [quoted text clipped - 48 lines] > > formula documented in "SQL Server 2005 Full-Text Search: Internals and > > Enhancements" at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sq l2005ftsearch.asp
> > Specifically, under "Ranking of FREETEXT" - > > [quoted text clipped - 63 lines] > >> Thanks a lot. > >> Alex.
|
|
|