SQL Server Forum / Programming / SQL / July 2008
Noise Word and Thesaurus UDF
|
|
Thread rating:  |
Dooza - 25 Jul 2008 16:43 GMT Can anyone help me create a function that will accept a string of words, then remove any that are listed in a noise word table, add any that match in the thesaurus table, and returned to whatever called it?
Cheers,
Steve
--CELKO-- - 25 Jul 2008 17:01 GMT 1) Where is the DDL? 2) Why would you do this in SQL? 3) There are many cheap or free text search packages that will handle this type of problem much better.
Dooza - 25 Jul 2008 17:26 GMT > 1) Where is the DDL? > 2) Why would you do this in SQL? > 3) There are many cheap or free text search packages that will handle > this type of problem much better. Hi Joe, At this moment in time I am trying to get my head around if its actually the right path to take, so there isn't really a DDL to speak of.
What I am trying to do is have a product search in my webshop. There are thousands of products, and at the moment I use an SP with LIKE '%' + @search + '%' and thats just not good enough. I want users to be able to enter any combination of words to help them find what they are looking for.
I also want to use a thesaurus for specific items, for instance, as the products are entertainment technology, someone may type in Gel when they want Filter, or gaffer when they want gaffa, or bulbs when they want lamps. There are many industry specific words that are being searched incorrectly (I record all 0 result searches)
Therefore I think a Stored Procedure that utilises Full Text Search is one way to go. I want to offer the user 3 options:
1. Match all words 2. Match any word 3. Exact match
Here is what I have already:
CREATE PROCEDURE [dbo].[advancedSearch] @search nvarchar(500) = NULL, @p int = 1 AS BEGIN
DECLARE @term nvarchar(500)
SET NOCOUNT ON;
-- @p = 0 means ANY words -- @p = 1 means ALL words -- @p = 2 means EXAXT match
IF @p = 0 -- ANY BEGIN SELECT @term = '"' + @search + '"'
SELECT @term SELECT R.RANK, I.prd_item, I.prd_wdesc, I.prd_wnotes FROM tblProducts2 AS I JOIN FREETEXTTABLE(tblProducts2,prd_wdesc,@term) AS R ON I.prd_item = R.[KEY] ORDER BY R.RANK DESC END
IF @p = 1 -- ALL BEGIN SELECT @term = @search SELECT @term = REPLACE(@search, ' ','" AND "') SELECT @term = '"' + @term + '"'
SELECT @term SELECT R.RANK, I.prd_item, I.prd_wdesc, I.prd_wnotes FROM tblProducts2 AS I JOIN CONTAINSTABLE(tblProducts2,prd_wdesc, @term) AS R ON I.prd_item = R.[KEY] ORDER BY R.RANK DESC END
IF @p = 2 -- EXACT BEGIN SELECT @term = '"' + @search + '"'
SELECT @term SELECT R.RANK, I.prd_item, I.prd_wdesc, I.prd_wnotes FROM tblProducts2 AS I JOIN CONTAINSTABLE(tblProducts2,prd_wdesc, @term) AS R ON I.prd_item = R.[KEY] ORDER BY R.RANK DESC END END
If you, or anyone else, can point me in the right direction I would be really grateful.
Steve
--CELKO-- - 25 Jul 2008 20:05 GMT If you really want this kind of interface, then I would look for a text search tool. But for a specialized market, why not a tree of keywords and pull-downs, like most websites these days? Itis pretty easy to construct an EBay style series of paths like "Shoes -> Women's -> casual -> slip-ons" on the screen with the nested sets model.
Dooza - 28 Jul 2008 10:54 GMT > If you really want this kind of interface, then I would look for a > text search tool. But for a specialized market, why not a tree of > keywords and pull-downs, like most websites these days? Itis pretty > easy to construct an EBay style series of paths like "Shoes -> Women's > -> casual -> slip-ons" on the screen with the nested sets model. From my experience many people prefer to just type in what they want and let the site do all the hard work. We already have a nice category style navigation system which works very well, I just want to add a simple to use search engine to help refine the users search.
We currently have just 12,000 items in our webshop, this is continuously growing, and within the next 6 months it should be around 50,000 items. We want our users to find what they want as quickly as possible, and a search engine is ideal for this.
If I had SQL 2005 this wouldn't be an issue, as the Thesaurus works, but I am limited to SQL 2000, so need to implement my own.
I am working through the examples given by others, to see if I can blend them into the full text search that I already have working.
What I need at the moment is a function that accepts a search string, takes out noise words, and adds thesaurus words. I will then take the output from the function, add quotes around each word, and add "AND" in between them depending on the type of search being carried out.
Steve
--CELKO-- - 28 Jul 2008 15:43 GMT Here is an old "Cut & Paste" of mine for a trick to load tables with criteria and not have to use dynamic SQL:
skill = Java AND (skill = Perl OR skill = PHP)
becomes the disjunctive canonical form:
(Java AND Perl) OR (Java AND PHP)
which we load into this table:
CREATE TABLE Query (and_grp INTEGER NOT NULL, skill CHAR(4) NOT NULL, PRIMARY KEY (and_grp, skill));
INSERT INTO Query VALUES (1, 'Java'); INSERT INTO Query VALUES (1, 'Perl'); INSERT INTO Query VALUES (2, 'Java'); INSERT INTO Query VALUES (2, 'PHP');
Assume we have a table of job candidates:
CREATE TABLE Candidates (candidate_name CHAR(15) NOT NULL, skill CHAR(4) NOT NULL, PRIMARY KEY (candidate_name, skill));
INSERT INTO Candidates VALUES ('John', 'Java'); --winner INSERT INTO Candidates VALUES ('John', 'Perl'); INSERT INTO Candidates VALUES ('Mary', 'Java'); --winner INSERT INTO Candidates VALUES ('Mary', 'PHP'); INSERT INTO Candidates VALUES ('Larry', 'Perl'); --winner INSERT INTO Candidates VALUES ('Larry', 'PHP'); INSERT INTO Candidates VALUES ('Moe', 'Perl'); --winner INSERT INTO Candidates VALUES ('Moe', 'PHP'); INSERT INTO Candidates VALUES ('Moe', 'Java'); INSERT INTO Candidates VALUES ('Celko', 'Java'); -- loser INSERT INTO Candidates VALUES ('Celko', 'Algol'); INSERT INTO Candidates VALUES ('Smith', 'APL'); -- loser INSERT INTO Candidates VALUES ('Smith', 'Algol');
The query is simple now:
SELECT DISTINCT C1.candidate_name FROM Candidates AS C1, Query AS Q1 WHERE C1.skill = Q1.skill GROUP BY Q1.and_grp, C1.candidate_name HAVING COUNT(C1.skill) = (SELECT COUNT(*) FROM Query AS Q2 WHERE Q1.and_grp = Q2.and_grp);
You can retain the COUNT() information to rank candidates. For example Moe meets both qualifications, while other candidates meet only one of the two. You can Google "canonical disjunctive form" for more details. This is a form of relational division.
For a program to convert to look at this pdf and implement the list processing shown on page 6 of the paper in your favorite recursive language.
http://www.ub.ro/pubs/scssm/39_Mihis%20ver-pag493-502.pdf
Eric Russell - 25 Jul 2008 19:09 GMT Here is something I tossed together to illustrate how you might implement a thesaurus and noise keyword table. It is functional and references the ProductDescription table in the AdventureWorks sample database.
declare @userinput as varchar(8000), -- Input parameter @i as int, @w as varchar(8000)
declare @include table (keyword varchar(8000)) declare @exclude table (keyword varchar(8000))
declare @thesaurus table (input varchar(8000), keyword varchar(8000)) insert into @thesaurus (input,keyword) values ('offroad','off-road') insert into @thesaurus (input,keyword) values ('superior','super')
declare @noise table (keyword varchar(8000)) insert into @noise (keyword) values ('and') insert into @noise (keyword) values ('with') insert into @noise (keyword) values ('for')
select @userinput = '+superior;+offroad;+bike;+for;+racing;-aluminum;', @i = 0, @w = ''
-- Parse the string of keywords into two tables; one to include and one to exclude: while @i < len(@userinput) begin select @i = @i + 1 if substring(@userinput,@i,1) <> ';' begin select @w = @w + substring(@userinput,@i,1) end else begin if left(@w,1) = '+' insert into @include (keyword) values (substring(@w,2,8000)) if left(@w,1) = '-' insert into @exclude (keyword) values (substring(@w,2,8000)) select @w = '' end end
-- Remove noise keywords: delete from @include where keyword in (select keyword from @noise) delete from @exclude where keyword in (select keyword from @noise)
-- Substitute keywords with thesausus equivalent: update @include set keyword = T.keyword from @include as K join @thesaurus as T on T.input = K.keyword
update @exclude set keyword = T.keyword from @exclude as K join @thesaurus as T on T.input = K.keyword
-- Select contents of included and excluded keywords: select '@include' as K, * from @include select '@exclude' as K, * from @exclude
-- Query the [ProductDescription] table: select P.ProductDescriptionID, P.Description, count(*) as Hits from AdventureWorks.Production.ProductDescription as P join @include as I on P.Description like '%' + I.keyword + '%' left join @exclude as E on P.Description like '%' + E.keyword + '%' where E.keyword is null group by P.ProductDescriptionID, P.Description
K keyword @include super @include off-road @include bike @include racing
K keyword @exclude aluminum
ProductDescriptionID Description Hits 8 Suitable for any type of riding, on or off-road. Fits any budget. Smooth-shifting with a comfortable ride. 1 64 This bike delivers a high-level of performance on a budget. It is responsive and maneuverable, and offers peace-of-mind when you decide to go off-road. 2 88 For true trail addicts. An extremely durable bike that will go anywhere and keep you in control on challenging terrain - without breaking your budget. 1 168 Top-of-the-line competition mountain bike. Performance-enhancing options include the innovative HL Frame, super-smooth front suspension, and traction for all terrain. 2 170 Suitable for any type of off-road trip. Fits any budget. 1 209 Entry level adult bike; offers a comfortable ride cross-country or down the block. Quick-release hubs and rims. 1 249 Value-priced bike with many features of our top-of-the-line models. Has the same light, stiff frame, and the quick acceleration we're famous for. 1 320 Same technology as all of our Road series bikes, but the frame is sized for a woman. Perfect all-around bike for road or racing. 2 321 Same technology as all of our Road series bikes. Perfect all-around bike for road or racing. 2 337 A true multi-sport bike that offers streamlined riding and a revolutionary design. Aerodynamic design lets you ride with the pros, and the gearing will conquer hilly roads. 1 375 Cross-train, race, or just socialize on a sleek, aerodynamic bike. Advanced seat technology provides comfort all day. 1 376 Cross-train, race, or just socialize on a sleek, aerodynamic bike designed for a woman. Advanced seat technology provides comfort all day. 1 409 Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads. 1 513 All-occasion value bike with our basic comfort and safety features. Offers wider, more stable tires for a ride around town or weekend trip. 1 554 The plush custom saddle keeps you riding all day, and there's plenty of space to add panniers and bike bags to the newly-redesigned carrier. This bike has stability when fully-loaded. 1 613 Superior shifting performance. 1 618 Super rigid spindle. 1 627 All-weather brake pads; provides superior stopping by applying more surface to the rim. 1 697 All-purpose bar for on or off-road. 1 890 Lightweight kevlar racing saddle. Leather. 1 1187 Carries 4 bikes securely; steel construction, fits 2" receiver hitch. 1 1188 Clip-on fenders fit most mountain bikes. 1 1200 Men's 8-panel racing shorts - lycra with an elastic waistband and leg grippers. 1 1201 Perfect all-purpose bike stand for working on your bike at home. Quick-adjusting clamps and steel construction. 1 1205 Short sleeve classic breathable jersey with superior moisture control, front zipper, and 3 back pockets. 1
(25 row(s) affected)
> Can anyone help me create a function that will accept a string of words, > then remove any that are listed in a noise word table, add any that [quoted text clipped - 3 lines] > > Steve shuurai11@gmail.com - 25 Jul 2008 19:58 GMT > Can anyone help me create a function that will accept a string of words, > then remove any that are listed in a noise word table, add any that > match in the thesaurus table, and returned to whatever called it? The following function will take a string of words and split them into a table... the only caveat is that you'll need to create some logic (either in the function or elsewhere) to remove any commas etc. that the user might input.
create function fn_SplitList ( @List varchar(8000) ) returns @ReturnTable table ( value varchar(100) ) as
begin declare @current varchar(100)
while ( charindex(' ', @List, 1) > 0 ) begin if (charindex(' ', @List, 1) > 0 ) begin set @current = SUBSTRING( @List, 1, charindex(' ', @List, 1) -1 ) insert @ReturnTable ( Value ) values ( @current ) end set @List = SUBSTRING( @List, charindex(' ', @List, 1) + 1, LEN(@List) ) end insert @ReturnTable ( Value ) values ( @List ) return end
Once you have this, you can do something like the following to get your result set. I wouldn't bother worrying about removing the noise words; a word either relates to your list or it does not. If it does you'll get a hit and if it doesn't you won't.
declare @products table (product_name varchar(20)) declare @thesaurus table (product_name varchar(20), synonyms varchar(50))
declare @searchTerm varchar(50) set @searchTerm = 'some kind of gel stuff'
set nocount on insert @products values ('lamp') insert @products values ('gaffe') insert @products values ('filter') insert @products values ('string')
insert @thesaurus values ('lamp', 'light bulb') insert @thesaurus values ('gaffe', 'gaffer giraffe') insert @thesaurus values ('filter', 'gel goop') insert @thesaurus values ('string', 'rope yarn filament') set nocount off
select p.product_name from @products p inner join @thesaurus t on p.product_name = t.product_name inner join dbo.fn_splitList( @searchTerm ) s on t.synonyms like '%'+s.value+'%' or t.product_name like '%'+s.value+'%'
Dooza - 28 Jul 2008 11:51 GMT This morning I found this: http://www.dbforums.com/archive/index.php/t-329694.html
It showed an example of how to remove noise words from a string. I was able to create this from it:
[code]CREATE function [dbo].[fn_thesaurus](@str varchar(200)) returns varchar(200) as begin
declare @txt varchar(200) declare @t table(word varchar(50))
--Split search text into words - space separator insert @t(word) select value from dbo.fn_Split(@str,' ')
--insert thesaurus words insert @t(word) select synonyms from tblThesaurus inner join @t on word = tblThesaurus.product_name
--remove original word that matached the thesaurus word delete from @t from @t t join tblThesaurus s (nolock) on t.word = s.product_name
--remove noise words by joining to noise table delete from @t from @t t join tblNoise s (nolock) on t.word=s.noiseword
--reassemble string SELECT @txt = COALESCE(@txt +' ','') + word FROM @t SELECT @txt =@txt
return @txt end[/code]
It takes the string, adds matching thesaurus words, removes the matched word, and removes any noise words.
My main search function will add the quotes and "AND" depending on the search being carried out. This was simpler than I imagined, but since I have never create a function before I thought it was going to be much harder.
Steve
|
|
|