Yes, there is a performance hit while importing blobs, mainly disk and to a
lesser degree cpu. 500 Mgs is a large chunk of data, there is no easy way to
push it in. Any way you could extract the textual content and store it in
the db?
In general you store the blob in the database for manageability reasons. For
performance reasons you store the path in the db and keep the blob in the
file system. Research has indicated that you should store the blobs in a
directory structure maintained by a hash so that you have a large number of
subdirectories. retrieving a blob this way offers best seek performance in
NTFS. Also web servers are optimized for sending files, databases aren't.
You can also send the file to the client asynchronously using an iframe, if
you are retrieving it from the db it would be done serially.

Signature
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
> Hey Hilary,
>
[quoted text clipped - 21 lines]
>
> Rob Heinen
bormesh@gmail.com - 22 Aug 2006 21:19 GMT
Thanks for the information Hilary,
When I'm inserting the blob into the database, I was under the
impression that it had to be split up into a series of smaller files.
If a large file is inserted, then it will require many small chunks. As
we insert, I assume we're going to be using RAM to store the chunks and
then we'll hit back and forth between the file system and the database,
all the while using up our memory. Am I correct in this assumption? I
guess I'm just wondering at the feasiblity of actually inserting a
larger blob (maybe not as huge as 500 mb) on a slower computer
(assuming the computers that we're developing for have an average of
512MB of RAM). Or does the amount of RAM on the client system not
matter, and only the specs of our Database machine matter?
I hope this makes sense... I can give you more information if you need
it.
Thanks,
Rob Heinen
> Yes, there is a performance hit while importing blobs, mainly disk and to a
> lesser degree cpu. 500 Mgs is a large chunk of data, there is no easy way to
[quoted text clipped - 49 lines]
> >
> > Rob Heinen
Hilary Cotter - 23 Aug 2006 00:23 GMT
Older versions of SQL Server required the blob be loaded in chunks, older
versions of ADO required you use the getchunk method, but newer versions
allow you use to stream object to push it in. The client will consume
virtual ram on the host it is running on, not necessarily the SQL Server
box.
There is a cpu and disk IO hit while inserting the blob. I have not
observered a RAM hit.

Signature
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
> Thanks for the information Hilary,
>
[quoted text clipped - 75 lines]
>> >
>> > Rob Heinen
Kevin.w.Berridge@gmail.com - 24 Aug 2006 18:24 GMT
Your first suggestion was: "Any way you could extract the textual
content and store it in
the db?"
We could do this, but we thought it might be better to store the binary
in the database and not have a copy of the binary on the file system
with a copy of the full text in the database and a pointer in the
database to the binary file as well as a mechanism to extract the full
text from the file. What's your take on this?
Also, even inserting the text instead of the binary, if the file is
sufficiently large I can imagine a situation in which inserting the
blob the standard way could cause some very un-needed hard drive
thrashing. (The standard way being reading all the text into a
variable and then supplying that variable as a parameter to an ad hoc
SQL query or Stored Procedure.) In ADODB there was the Stream object,
but there doesn't seem to be an equivalent in ADO.NET. Do you have a
suggestion for a good way to go about this?
Finally, we also wanted to return a sample of the text that matched the
search with our results. I know that this is not possible in SQL
Server, so I'm wondering there might be an entirely different approach
for both storing and searching these documents that we should consider.
Thanks for all your help and feedback!
Kevin
Kevin.w.Berridge@gmail.com - 29 Aug 2006 19:31 GMT
"Research has indicated that you should store the blobs in a
directory structure maintained by a hash so that you have a large
number of
subdirectories."
Could you expound on this a little more? We have decided to continue
storing our documents in the file system. I've been looking for some
sources detailing the kind of directory structure you mention but so
far no luck.
Thanks,
Kevin
Hilary Cotter - 29 Aug 2006 21:50 GMT
Have a look at
http://groups.google.com/group/microsoft.public.win2000.file_system/browse_frm/t
hread/e7b4d86095272427/257d6a0a740f9d6c?lnk=st&q=%22maximum+number+of+files%22+%
22in+a+folder%22+windows&rnum=1&hl=en#257d6a0a740f9d6c
and
http://www.mcse.ms/archive42-2004-12-1284650.html
Basically on previous projects I have worked on we got optimal performance
by having 100 subdirectories off the root, each with 100 subdirectories in
it, each with 100 files. Also you get better performance with file name like
00000filename.jpg than filename00000.jpg.

Signature
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
> "Research has indicated that you should store the blobs in a
> directory structure maintained by a hash so that you have a large
[quoted text clipped - 8 lines]
> Thanks,
> Kevin