Bob,
While that is sometimes the issue with the FT Indexing of PDF files using
Adobe's PDF IFilter, that is not the case for Louie's specific PDF file and
his specific problem in this thread. He did email me the PDF file and I used
Filtdump to analyze the content and because of how the PDF was created, the
content is "garbage" to the PDF IFilter.
Filtdump. that is part of the Platform SDK that can dump and analyze the
content of files based upon the IFilter, in this case Adobe's PDF IFilter.
I've run this utility against your PDF file (test.pdf) and below is a part
of the output:
filtdump -b d:\test.pdf
-- output:
Microsoft Word - GAOG Prospectus Rays 5 Mar working copy.doc
! !"#$%&$ ''()*&+, (-, (
(...'+'+'/"#$%!&''()*'!'!+,-".'&!$''/01''!2!3+!'!! ))(.'#!41'')/5#&"6-!7#!
""8..&"(/.'41&!!,00$9#&''$'#'.''#, 0: ;
<1<7.=("66('<#=5"66('<..!;!=/"66('<.&;!!>=)""66(????????????????????????????
?????????????????????
!!=566666=@606???????????????????????????????????????????????? A;;#; ;$
!)6666!@606!)6666!)6666! .....
<snip>
While I was able to open this pdf file with Adobe's Acarbot PDF reader, it
looks to me that this PDF file was not actually created via Adobe's PDF
Creater and instead was possible created via MS Word or some other 3rd party
tool or was converted improperly from a MS Word doc file.
FYI, the issue you speak of is doc'ed in KB article "Q323040 BUG: SQL Server
Full-Text Population by Using a Single-Threaded Filter DLL or a PDF Filter
DLL May Not Succeed" at
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q323040
Regards,
John
> I have found this necessary for Full text indexing of PDF's
>
[quoted text clipped - 7 lines]
>
> Bob Horkay
Louie - 12 Jul 2004 01:03 GMT
John,
I have tried another pdf (from Acrobat itself) and it worked. I think we
finaly located the source of the problem.
According to your explanation:
"... was converted improperly from a MS Word doc file."
So, is it true that if a MS Word (or any files) was properly converted
to pdf using a 3rd party software, it would work.
The reason I am asking is that in my development environment, all PDFs
are created/provided from various sources, we don't generate the PDFs
ourselves. Which means we need to handle PDFs that are created by
software other than Acrobat's.
I am going to do some tests on other PDFs as well, and I will let you
know the outcome.
Thanks again,
Louie
John Kane - 12 Jul 2004 04:22 GMT
You're welcome, Louie,
Whether or not the PDF file was "improperly converted" or properly converted
from MS Word as the header info (Microsoft Word - GAOG Prospectus Rays 5 Mar
working copy.doc) to the PDF format, I cannot say, but for some reason the
Adobe PDF IFilter was not able to recognize this as a proper PDF file. You
might want to talk to Adobe and ask them about this situation.
Either way, one thing you can do is to open other problem PDF files with
either Notepad. or some other utility (filtdump.exe) and look for the
*correct* string or output from filtdump. Yes, please do let me and others
on this newsgroup know what your research turns up!
Regards,
John
> John,
>
[quoted text clipped - 20 lines]
> *** Sent via Devdex http://www.devdex.com ***
> Don't just participate in USENET...get rewarded for it!
Louie - 26 Jul 2004 06:31 GMT
Eventually, we decided to extract text from the pdf files and store the
text instead. Since we need to retrieve the pdf files after searching is
done, so there is no point storing the actual files twice.
John Kane - 26 Jul 2004 15:32 GMT
Louie,
Thank you for the feedback on what your research turned up and your
solution! Since you're storing the text of the pdf files (and other file
types too) in SQL Server, can I assume you will store only a pointer to the
actual pdf files on disk for retrieval of the files when required?
Thank again,
John
> Eventually, we decided to extract text from the pdf files and store the
> text instead. Since we need to retrieve the pdf files after searching is
> done, so there is no point storing the actual files twice.