Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Other Technologies / Full-Text Search / March 2006

Tip: Looking for answers? Try searching our database.

Multilingual content indexing

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Radrizzi Gilles - 24 Mar 2006 15:58 GMT
Hi there,
We have a Windows Sharepoint Services installation and are indexing its
content. I have a question regarding the index so I thought it might be
better to post here instead of the WSS newsgroups.

Anyway, here we go:
We have thousands of documents with content in multiple langauges (e.g.
English and german, english and portuguese, etc...).
Of course there are documents which contain only 1 language.
Now when trying a search for let's say: "muito" which is in the list of
Portuguese noisewords, the index return some documents, but not all which
contain this word.
So I was wondering how exactly do noise words work. I can understand that
the index woudln't return portuguese only documents becaus it is in the
langauges noisewords list. But what about multi-language documents? Does
Sharepoint/Index Server determine what langauge a document is in and then
apply the correct noise word file. Or does it always apply the English
noisewords (the server is an english installation).
How is it possible that some documents which contain a given word are
returned but not all?

Hope I was a little bit clear and someone can help me

Regards

Gilles
Hilary Cotter - 24 Mar 2006 16:33 GMT
Basically some ifilters will respect embedded language tags for some
document types (word, xml, html). These documents may be broken by different
language word breakers than the default one for your server.

The words will be broken according to language rules and stored in your
catalog as such.

Then when you search the default language rules will be applied at query
time (or overridden if you use the language predicate).

Consider a word doc tagged as German. The words will be broken according to
language rules - so wanderlust would be broken and stored in your catalog as
wanderlust, wandern, and lust.

If you search on it using the English language options you will only get
hits to this document. If you search on lust using the English language
options you will get hits to this document. If you search on wanderlust
using the German language options you will get hits to documents in a
variety of languages containing wanderlust, wandern, and lust.

Watch out for false friends/false conjugates and wander words/wanderworts.
Signature

Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

> Hi there,
> We have a Windows Sharepoint Services installation and are indexing its
[quoted text clipped - 22 lines]
>
> Gilles
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.