SQL Server Forum / Other Technologies / Full-Text Search / October 2004
Word breakers and "special" characters
|
|
Thread rating:  |
Daniel Crichton - 20 Oct 2004 14:04 GMT I've been digging around trying to find out how to allow my FTS implementation deal with punctuation and "special" characters in a way that fits my needs, but can't find a definite answer. I'm leaning towards trying to switch to the Neutral Word Breaker setting to see if it will fix it, but don't want to risk messing anything else up, so I thought I'd ask here in case anyone else has found a solution to this issue.
I need to be able to allow searches for words like .net, c#, and c++. It appears that the .,#, and + are used as word breakers and so not indexed. At the moment a search using a clause such as CONTAINS(Title,'"c#"') will return all titles that have a word starting with C in them (so it's not even just returning those that have the letter C by itself in the index, it's treating the # as a wildcard). I've also tried escaping the # using CONTAINS(Title,'"c[#]"') with the same result.
Would using the Neutral word breaker help me? Or am I going to have to be a bit more creative and create a "searchable" version of my title field which replaces # with the word sharp, . with the word dot if something follows it without a space, and + with the word plus, and index on that? Then if someone types in c# I would translate this to CONTAINS(SearchTitle,'"csharp"') to get the required results.
Dan
Daniel Crichton - 20 Oct 2004 14:09 GMT In case it's needed, I'm running SQL Server 7.0 SP2 on Windows 2000 Server SP3.
Dan
Hilary Cotter - 20 Oct 2004 16:56 GMT It might. SQL 2000 does index these words correctly.
In the long run you would be better in the long run to trap these tokens and expand them to csharp as you are contemplating doing.
> I've been digging around trying to find out how to allow my FTS > implementation deal with punctuation and "special" characters in a way that [quoted text clipped - 19 lines] > > Dan John Kane - 22 Oct 2004 19:40 GMT Hilary, it is not SQL Server 2000 that indexes these "special" characters incorrectly, but in fact is the OS-supplied wordbreaker dll, in this case for Windows 2000 Server - infosoft.dll as I'm sure you're aware of this fact!
Daniel, you should review the following Google Groups link for some past & very active discussion on this subject: http://groups.google.com/groups?q=langwrbk+infosoft (difference in OS-supplied wordbreakers) and http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003).
Enjoy, John
> It might. SQL 2000 does index these words correctly. > [quoted text clipped - 32 lines] > > > > Dan Hilary Cotter - 24 Oct 2004 02:24 GMT not always. For instance SQL FTS does an existence check for these files and others during the installation process and installs them if they are missing. This is how you can run SQL FTS on NT Workstation or NT Server which does not have Index Server installed on.
 Signature Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
> Hilary, it is not SQL Server 2000 that indexes these "special" characters > incorrectly, but in fact is the OS-supplied wordbreaker dll, in this case [quoted text clipped - 5 lines] > http://groups.google.com/groups?q=langwrbk+infosoft (difference in > OS-supplied wordbreakers) and http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext
> (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003). > [quoted text clipped - 41 lines] > > > > > > Dan John Kane - 24 Oct 2004 04:23 GMT Yes, always. I was not referring to "files" (noise word files, such as noise.enu) but I was referring to "special characters", i.e.. punctuation characters, such as + (plus) or # (pound symbol) as these are characters and not the noise word files as only the $ (dollar symbol) and _ (underscore - in noise.dat) are included in the noise word files. In truth, these special characters are included in all OS supplied code pages and it is the OS-supplied wordbreakers that incorrectly index these special characters.
Best Regards, John
> not always. For instance SQL FTS does an existence check for these files and > others during the installation process and installs them if they are [quoted text clipped - 10 lines] > > http://groups.google.com/groups?q=langwrbk+infosoft (difference in > > OS-supplied wordbreakers) and http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext
> > (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003). > > [quoted text clipped - 46 lines] > > > > > > > > Dan Hilary Cotter - 25 Oct 2004 00:53 GMT I suggest you review this link and point out to me where infosoft.dll ships in NT 4.0 server or workstation.
http://support.microsoft.com/dllhelp/default.aspx?dlltype=file&l=55&alpha=infoso ft.dll&S=1&x=4&y=12
And you were referring to the word breakers components if I might quote you ", but in fact is the OS-supplied wordbreaker dll, in this case for Windows 2000 Server - infosoft.dll as I'm sure you're aware of this fact!"
My point is that these files are not supplied by the OS in every case, and for SQL Server 2000 installed on NT server and NT workstation, it is supplied by SQL Server.
I hardly find such quibbling of yours helpful to the community at large.
> Yes, always. I was not referring to "files" (noise word files, such as > noise.enu) but I was referring to "special characters", i.e.. punctuation [quoted text clipped - 89 lines] >> > > > >> > > > Dan John Kane - 25 Oct 2004 02:48 GMT Hilary, you have missed my point entirely! I have reviewed the below link and it only provides a list of version numbers for the infosoft.dll file as I'm NOT referring to the noise word files, but about the "special" characters and I did suggest that you email me directly and that we take this discussion offline. Why have you not done so?
John
> I suggest you review this link and point out to me where infosoft.dll ships > in NT 4.0 server or workstation. http://support.microsoft.com/dllhelp/default.aspx?dlltype=file&l=55&alpha=infoso ft.dll&S=1&x=4&y=12
> And you were referring to the word breakers components if I might quote you > ", but in fact is the OS-supplied wordbreaker dll, in this case [quoted text clipped - 39 lines] > >> > http://groups.google.com/groups?q=langwrbk+infosoft (difference in > >> > OS-supplied wordbreakers) and http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext
> >> > (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003). > >> > [quoted text clipped - 55 lines] > >> > > > > >> > > > Dan Hilary Cotter - 25 Oct 2004 12:29 GMT exactly where is this invitation to take this discussion offline? Let me quote you once again, and then please quote where you posted this non-existent invitation?
"and I did suggest that you email me directly and that we take
> this discussion offline. Why have you not done so? "
I am questioning your response about the "but in fact is the OS-supplied wordbreaker dll, in this case for Windows 2000 Server - infosoft.dll as I'm sure you're aware of this fact!" to quote you.
This is not supplied by the OS always.
 Signature Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html
> Hilary, you have missed my point entirely! I have reviewed the below link > and it only provides a list of version numbers for the infosoft.dll file as [quoted text clipped - 7 lines] > ships > > in NT 4.0 server or workstation. http://support.microsoft.com/dllhelp/default.aspx?dlltype=file&l=55&alpha=infoso ft.dll&S=1&x=4&y=12
> > And you were referring to the word breakers components if I might quote > you [quoted text clipped - 45 lines] > > >> > http://groups.google.com/groups?q=langwrbk+infosoft (difference in > > >> > OS-supplied wordbreakers) and http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext
> > >> > (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003). > > >> > [quoted text clipped - 62 lines] > > >> > > > > > >> > > > Dan Kent Tegels (MVP) - 25 Oct 2004 22:07 GMT Okay you two, I'd hate to see a "FullTextSearch Celebrity Death Match" have to take place to settle this one. If you need a mediator, I'll step up. Otherwise, how about agreeing to disagree and moving on?
Thanks, Kent Tegels MVP - SQL Server
The SSX FAQ & Blog: http://tinyurl.com/6r4gb Looking for XM, the GUI for SSX? See both: http://tinyurl.com/4dfee and http://tinyurl.com/53hts My Blog: http://www.tegels.org/
John Kane - 26 Oct 2004 01:36 GMT Thank you, Kent, I've already posted the following in another thread (subject: Re: Filter Html tags on Full text Search ):
"Ok, let it be noted that I tried to contact you [Hilary] and you did not reply and you have requested that these (un-related) discussions be continued in the online forum as I do not believe that they contribute the community. Don't be surprised that I disagree with you and question you responses as lately they have be lacking in technical content."
I have repeatedly asked Hilary to take this offline, but he has refused. I've also have cc'ed Stephen on the above reply. I do appreciate your efforts as I do think that Hilary is being un-reasonable and I'm more than willing to take this offline with you or anyone else. Lately, Hilary's replies seem to be incorrect and not at the technical level of what I have come to expect from a SQL MVP.
Best regards, John
> Okay you two, I'd hate to see a "FullTextSearch Celebrity Death Match" have to take place to settle this one. If you need a mediator, I'll step up. Otherwise, how about agreeing to disagree and moving on?
> Thanks, > Kent Tegels [quoted text clipped - 6 lines] > My Blog: > http://www.tegels.org/ Kent Tegels (MVP) - 26 Oct 2004 04:35 GMT > Thank you, Kent, > I've already posted the following in another thread (subject: Re: Filter > Html tags on Full text Search ): Well, I don't really care about spilled milk. I have no doubt that both of you have things to say that contribute value. Rehashing the past doesn't, IMHO.
> I have repeatedly asked Hilary to take this offline, but he has refused. Well, again, if you'd like I'm happy to act as the go between. You have my address. I'd also be happy to see it resolved. I'll encourage you again to contact me to chat about the topic.
> I do appreciate your efforts as I do think that Hilary is being > un-reasonable and I'm more than willing to take this offline with you or > anyone else. I can't say either of you are being reasonable or unreasonable, I'm just looking to get the nose to signal ratio down so I can learn something. :)
> Lately, Hilary's replies seem to be incorrect and not at the > technical level of what I have come to expect from a SQL MVP. Well, all I can say I'm glad that I wasn't held to such I high standard when I got mine. I'm so hyperfocused that 99% of the conversations on this list blister right by me. Even in my own area, I'm wrong plenty of times, but thankfully, folks are there to point that out in a way that helps everybody.
Thanks, Kent Tegels MVP - SQL Server
The SSX FAQ & Blog: http://tinyurl.com/6r4gb Looking for XM, the GUI for SSX? See both: http://tinyurl.com/4dfee and http://tinyurl.com/53hts My Blog: http://www.tegels.org/
John Kane - 26 Oct 2004 05:40 GMT You're welcome, Kent, In the final analysis, these newsgroups are about answering and resolving questions & problems for microsoft customers and uses of SQL Server. I helped establish this newsgroup while I was at Microsoft back in 2000 and have been posting in it now for many years (and a few others too). Daniel (who started this thread) did thank me my answer & links that I provided: "Thanks for those links. It's hard to find things about C# using Google Groups as the # is ignored :\". So, for now this particular thread is done.
What is past, is past and I'm ok with that & while I'm not yet a SQL MVP, I do have respect for the program and understand it's high level of knowledge and commitment that is necessary to gain the award as you have shown in your own efforts. In the end, it's about helping and resolving issues, it's just lately Hilary doesn't seem to be as focused on FTS as he is on Replication and his recent posting from the past 6 months show a lack the level of knowledge & commitment that he once had, I'm sad to say.... Why he did not want to take this offline, I cannot say, but I'm willing to discuss it offline with you or anyone else.
SQL FTS is a niche part of SQL Server and while it does have it's problems and nuances, I am glad that you're are "listening in" so you and other MVP's can learn a thing or two!
Stay in touch, John
> > Thank you, Kent, > > I've already posted the following in another thread (subject: Re: Filter > > Html tags on Full text Search ): > Well, I don't really care about spilled milk. I have no doubt that both of you have things to say that contribute value. Rehashing the past doesn't, IMHO.
> > I have repeatedly asked Hilary to take this offline, but he has refused. > Well, again, if you'd like I'm happy to act as the go between. You have my address. I'd also be happy to see it resolved. I'll encourage you again to contact me to chat about the topic.
> > I do appreciate your efforts as I do think that Hilary is being > > un-reasonable and I'm more than willing to take this offline with you or [quoted text clipped - 3 lines] > > technical level of what I have come to expect from a SQL MVP. > Well, all I can say I'm glad that I wasn't held to such I high standard when I got mine. I'm so hyperfocused that 99% of the conversations on this list blister right by me. Even in my own area, I'm wrong plenty of times, but thankfully, folks are there to point that out in a way that helps everybody.
> Thanks, > Kent Tegels [quoted text clipped - 6 lines] > My Blog: > http://www.tegels.org/ Daniel Crichton - 26 Oct 2004 09:55 GMT > have been posting in it now for many years (and a few others too). > Daniel > (who started this thread) did thank me my answer & links that I provided: > "Thanks for those links. It's hard to find things about C# using Google > Groups as the # is ignored :\". So, for now this particular thread is > done. I just wanted to also point out, however, that I have now implemented the token replacement that I asked about and Hilary concurred with me, as the changes in W2K SP3 for the handling of # and ++ does not help with the other requirement in my original post that it deal with .Net correctly too.
Dan
Kent Tegels (MVP) - 26 Oct 2004 13:56 GMT John, When I offer to take something off-line, I don't expect to see the follow-up in the group...
Thanks, Kent Tegels MVP - SQL Server
The SSX FAQ & Blog: http://tinyurl.com/6r4gb Looking for XM, the GUI for SSX? See both: http://tinyurl.com/4dfee and http://tinyurl.com/53hts My Blog: http://www.tegels.org/
Daniel Crichton - 25 Oct 2004 09:35 GMT > Hilary, it is not SQL Server 2000 that indexes these "special" characters > incorrectly, but in fact is the OS-supplied wordbreaker dll, in this case [quoted text clipped - 7 lines] > http://groups.google.com/groups?&q=csharp&meta=group%3Dmicrosoft.public.sqlserve r.fulltext > (for C vs. C++ vs. C# on Win2K vs. WinXP & Win2003). Thanks for those links. It's hard to find things about C# using Google Groups as the # is ignored :\
However, this isn't going to solve my issue. It'll fix the C# and C++ part, but .Net isn't going to be solved. It looks like I might as well do the word replacement for everything.
Dan
|
|
|