Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Other Technologies / Full-Text Search / March 2005

Tip: Looking for answers? Try searching our database.

Full text search on Chinese or Chinese/English mix?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Xin Chen - 12 Mar 2005 15:39 GMT
I want to use SQL 2005 FT to search on web page I crawled from web.  The
page can be Chinese, English or Chinese/English(Chinese article with English
phrase in it).

First question is that what language word breaker I should choose. Does
Chinese word breaker make its English content hard to search?.

Second question, Should I store text in different language in difference
catalog so that I can choose the specific word breaker for the FTS? but how
to determine what language a web page is using.  Most of Chinese and English
web page uses utf-8 charset which make it indistinguishable for my program
to determine which language it is using.  Shouldn't SQL server figure out
what word breaker to use automattically by examining the bytes of utf-8
encoding of the text?

Third, what encoding I should use when I insert the content of web page into
the full text database?  use utf-8, or gb2312(chinese) or Unicode? Does it
matter?

Your inputs are greatly appreciated.
Hilary Cotter - 14 Mar 2005 16:12 GMT
You have to use the ms.locale metatag for this to work, store your documents
in the image or varbinary data type, and then query using the Language
keyword. The language type you assign to the column is irrelevant as the
langauge tags in the document type dominate. Here is an example

CREATE TABLE blob

(pk INT not null IDENTITY(1,1) CONSTRAINT primarykey PRIMARY KEY,

blob VARBINARY(MAX),

blobtype VARCHAR(10))

GO

CREATE FULLTEXT INDEX ON blob

(blob TYPE COLUMN blobtype LANGUAGE 1033) --note the LCID is for American
English

KEY INDEX PrimaryKey ON catalog_name

GO

--note that these html documents we are pushing in are tagged with French
language metatags.

INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>mang?</BODY></HTML>'),'.htm')

INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>manger</BODY></HTML>'),'.htm')

GO

Querying for all stemmed forms of the French verb manger (to eat).

SELECT * FROM blob WHERE CONTAINS(*, 'formsof(inflectional,manger), language
1036)

--two rows returned.

Signature

Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

> I want to use SQL 2005 FT to search on web page I crawled from web.  The
> page can be Chinese, English or Chinese/English(Chinese article with English
[quoted text clipped - 16 lines]
>
> Your inputs are greatly appreciated.
Hilary Cotter - 14 Mar 2005 16:16 GMT
Maybe I didn't answer your question to well.

1) It doesn't matter what word breaker you select as for varbinary or image
data type columns where the document's contains language tags the iFilter
understands (HTML docs tagged with the ms.locale metatag, or Word and other
Office docs) the embedded language tag will control the word breaker used.

2) You don't have to if you are using the Image or varbinary data type
columns. For other data type columns you will.

3) utf-8 should work.
Signature

Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

> I want to use SQL 2005 FT to search on web page I crawled from web.  The
> page can be Chinese, English or Chinese/English(Chinese article with English
[quoted text clipped - 16 lines]
>
> Your inputs are greatly appreciated.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.