Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Other Technologies / Full-Text Search / July 2004

Tip: Looking for answers? Try searching our database.

Need help with searching PDF files stored in SQL

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Louie - 05 Jul 2004 06:42 GMT
I will explain step by step what I have done, hope it will make it
easier for the gurus to solve my problem.

Requirement: Be able to search PDF files stored in SQL.
SQL Server: 2000 (SP3)
Window:   2000 Server

1. Installed Acrobat iFilter 5.0
2. Created a table to store PDFs.
creat table PDFFiles
(
FileID int,
PDF image,
DocType char(4),
constraint pk_pdffiles primary key
(
  FileID
)
)

3. Set up the table for fulltext search.
exec sp_fulltext_table 'pdffiles', 'create', 'pdf', 'pk_pdffiles'
exec sp_fulltext_column 'PDFFiles', 'pdf', 'add', default, 'DocType'

4. Insert PDFs into the table.
Done by a custom app I have written. To verify that the PDF was
inserted correctly, I used the app to grab the PDF out and I could
open the file in Acrobat Reader successfully.

The table looks like something below:
FileID  PDF                                          DocType
1       0x255044462D312E330D0A25E2E3CFD30D0A3134...  .pdf

5. Populate the index
exec sp_fulltext_table 'pdffiles', 'start_full'

The population process only took a few seconds. The following is what
event viewer's application log said:

The end of crawl for project <SQLServer$TEST SQL0001200006> has been
detected. The Gatherer successfully processed 2 documents totaling 0K.
It failed to filter 0 documents. 0 URLs could not be reached or were
denied access.

By double clicking on PDF catalog in Enterprise Manager, I got the
following info:

Status: Idle
Item Count: 2 (note I only inserted one PDF file)
Catalog Size: 1 MB
Unique Key Count: 737

6. Do a query

select * from pdffiles
where contains(pdf, 'possible')

The query returned nothing. I tried several other keywords but all
have failed.

What have I done wrong or missed?
Thank you.

ps. the pdf file in binary form is attached below

0x255044462D312E330D0A25E2E3CFD30D0A3134352030206F626A0D0A3C3C0D0A2F4C696E656172697A656420310D0A2F4C203537393038300D0A2F48205B203133333420343134205D0D0A2F4F203134370D0A2F452037383438370D0A2F4E2033390D0A2F54203537363035320D0A3E3E0D0A656E646F626A0D0A20202020
Bob Horkay - 09 Jul 2004 13:20 GMT
I have found this necessary for Full text indexing of PDF's

Modify the Registry to set full text indexing to single threading, the
PDF Filter does not support multi-threading;  The key is:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\Gathering Manager\
And change the value of:  RobotThreadsNumber to 1.

There is a kb article somewhere on it, but I've forgotten the
number...

Bob Horkay
John Kane - 09 Jul 2004 15:00 GMT
Bob,
While that is sometimes the issue with the FT Indexing of PDF files using
Adobe's PDF IFilter, that is not the case for Louie's specific PDF file and
his specific problem in this thread. He did email me the PDF file and I used
Filtdump to analyze the content and because of how the PDF was created, the
content is "garbage" to the PDF IFilter.

Filtdump. that is part of the Platform SDK that can dump and analyze the
content of files based upon the IFilter, in this case Adobe's PDF IFilter.
I've run this utility against your PDF file (test.pdf) and below is a part
of the output:

filtdump -b d:\test.pdf
-- output:

Microsoft Word - GAOG Prospectus Rays 5 Mar working copy.doc
! !"#$%&$ ''()*&+, (-, (
(...'+'+'/"#$%!&''()*'!'!+,-".'&!$''/01''!2!3+!'!! ))(.'#!41'')/5#&"6-!7#!
""8..&"(/.'41&!!,00$9#&''$'&#05#'.''#, 0:  ;
<1<7.=("66('<#=5"66('<..!;!=/"66('<.&;!!>=)""66(????????????????????????????
?????????????????????
!!=566666=@606???????????????????????????????????????????????? A;;#; ;$
!)6666!@606!)6666!)6666! .....

<snip>

While I was able to open this pdf file with Adobe's Acarbot PDF reader, it
looks to me that this PDF file was not actually created via Adobe's PDF
Creater and instead was possible created via MS Word or some other 3rd party
tool or was converted improperly from a MS Word doc file.

FYI, the issue you speak of is doc'ed in KB article "Q323040 BUG: SQL Server
Full-Text Population by Using a Single-Threaded Filter DLL or a PDF Filter
DLL May Not Succeed" at
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q323040

Regards,
John

> I have found this necessary for Full text indexing of PDF's
>
[quoted text clipped - 7 lines]
>
> Bob Horkay
Louie - 12 Jul 2004 01:03 GMT
John,

I have tried another pdf (from Acrobat itself) and it worked. I think we
finaly located the source of the problem.

According to your explanation:
"... was converted improperly from a MS Word doc file."

So, is it true that if a MS Word (or any files) was properly converted
to pdf using a 3rd party software, it would work.

The reason I am asking is that in my development environment, all PDFs
are created/provided from various sources, we don't generate the PDFs
ourselves. Which means we need to handle PDFs that are created by
software other than Acrobat's.

I am going to do some tests on other PDFs as well, and I will let you
know the outcome.

Thanks again,
Louie
John Kane - 12 Jul 2004 04:22 GMT
You're welcome, Louie,
Whether or not the PDF file was "improperly converted" or properly converted
from MS Word as the header info (Microsoft Word - GAOG Prospectus Rays 5 Mar
working copy.doc) to the PDF format, I cannot say, but for some reason the
Adobe PDF IFilter was not able to recognize this as a proper PDF file. You
might want to talk to Adobe and ask them about this situation.

Either way, one thing you can do is to open other problem PDF files with
either Notepad. or some other utility (filtdump.exe) and look for the
*correct* string or output from filtdump. Yes, please do let me and others
on this newsgroup know what your research turns up!

Regards,
John

> John,
>
[quoted text clipped - 20 lines]
> *** Sent via Devdex http://www.devdex.com ***
> Don't just participate in USENET...get rewarded for it!
Louie - 26 Jul 2004 06:31 GMT
Eventually, we decided to extract text from the pdf files and store the
text instead. Since we need to retrieve the pdf files after searching is
done, so there is no point storing the actual files twice.
John Kane - 26 Jul 2004 15:32 GMT
Louie,
Thank you for the feedback on  what your research turned up and your
solution! Since you're storing the text of the pdf files (and other file
types too) in SQL Server, can I assume you will store only a pointer to the
actual pdf files on disk for retrieval of the files when required?

Thank again,
John

> Eventually, we decided to extract text from the pdf files and store the
> text instead. Since we need to retrieve the pdf files after searching is
> done, so there is no point storing the actual files twice.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.