On 28/6/2006 3:17 AM, "Kevin van Haaren" <kevin

vanhaaren.net> spake thus:
> I don't want to talk about WinFS specifically but is anyone aware of a
> filesystem that actually stores the file data in the same database as the
> metadata? Does anyone consider this a good idea?
>
> Files are variable things. I probably have files ranging from 1kb to 4GB, I
> don't see how a database could be optimized to deal with records of
> variable sizes very well (I know they can do them, I just don't think
> they're good at them.)
I can't speak for actual examples, but I can at least attack this from the
database side (more accurately, I can tell you how Oracle does it :) I
don't see any particular problem with storing the files in the database as
well, as long as the database engine is efficient. This was the big problem
with the original Be file system, IIRC, and why they dropped the database
idea in later versions. I was a little sad that they did so.
Almost every DBMS has to deal with variable size records, but in many cases
the variability in size is probably fairly small. If you're talking typical
business data with lots of text and numbers, then you're probably only
looking at a few tens of bytes per record anyway. However, there are plenty
of databases out there now with more "interesting" data in them such as
images and audio, which can cause dramatic variation in record sizes from
one table to the next.
Oracle (and I'm fairly certain a large number of other products) deals with
the issue of variable record size by ignoring it. OK, that's an
over-simplification :) It sets a fixed database block size (as distinct
from the disk block or OS block or record size), which is the smallest unit
that Oracle can read from or write to disk. From memory this defaults to
2KB, but can be any multiple of that up to about 32KB (our teaching system
is set to 8KB). If records are smaller than the block size, you get multiple
records per block, as many as will fit. If records are larger than the block
size, they get chopped into pieces that are stored one piece per block, and
chained together.
The trick then is to tune your block size to be a good fit with the typical
record size. You could base this on expected usage (e.g., large media files
vs. small text files), although that's probably difficult to predict on a
typical desktop system. Debian Linux does something like this during the
install process: you can nominate your typical disk usage scenario, and it
sets the number of inodes to match (so IIRC you can have fewer inodes on a
partition that will be mainly used for large media files, for example).
However, fixed block size technique falls apart when you're dealing with
really large multi-gigabyte chunks of data, especially if there's an upper
limit on the block size. In Oracle's case, even with the maximum block size
you could end up with many thousands of blocks just to hold one record. I'm
a little sketchy on this, but my understanding is that Oracle and other
DBMSs store such items (Oracle refers to them as "large objects") in a
separate part of the database that's optimised for large data. So the big
files and the small files are effectively segregated within the database and
accessed in different ways.
That's probably the best way to handle it: have one database for the file
system but partition it internally into areas optimised for different sizes
of files. You could even make the database partitions correspond to disk
partitions (or even separate drives). These are all standard tuning tricks
that have been around for years in large-scale databases, but have only
recently started to appear at the lower end.
--
Nigel Stanger, Dunedin, NEW ZEALAND.
http://xri.net/=nigel.stanger