>[But what's the real world downside? Emailer used to do this long ago, but
>the filesystem then couldn't come close to handling it at a decent
>performance level. Is performance still a problem with this approach? -Adam]
I did a full backup last night, over 100Mb Ethernet to a hard disk on a
fast computer running Retrospect in server mode. I watched casually and saw
the speed vary from 49 MB/min to 465 MB/min. It was obvious from watching
that the high speeds were on large files and the low speeds occurred while
copying large directories of small files. The large files were mostly
non-compressible stuff like .mov, so no compression effects were involved.
The top speed is close to the limits of the LAN, so the actual client
performance difference may have been even greater. OTOH without doing
better controlled experiments, I can't tell whether the different was due
to the file system itself or to the Retrospect client, nor whether the
Retrospect server could have had a bottleneck.
At any rate, that's a real world case with a ten-fold performance
difference. That's still meaningful to me.
At 12:22 PM 08/01/2005 -0700, Tom Robinson wrote:
>The biggest drawback of working at the file level is we're unlikely
>to see Spotlight indexing of FileMaker databases, Entourage e-mail,
>etc. But from a technical perspective, how could Apple have
>programmed Spotlight to jump to a particular record or e-mail?
>
>[I was hoping someone who knew more would jump in, but my impression is
>that Apple is at least working on, if they haven't already shipped, an API
>that developers can use to let Spotlight look inside database files. -Adam]
A file system is just a specialized database. There's no reason a database
-- at least a record-oriented or object-oriented database -- can't be given
an interface that makes it look like a file system, where the records are
files. After all, that's exactly what a mountable disk image is -- a set of
APIs that manipulate references to objects called "files" within some
container. The API is the same whether the container is a Macintosh HD, an
ISO 9660 CD, a network disk, a .dmg file, etc etc. For the container to be
a FileMaker database or a Mail mailbox is not a big deal compared with the
radical variations among file systems under modern OSs.
In the long view, it doesn't matter whether Spotlight is set up to view
file systems as specialized databases or to view databases as specialized
file systems, as long as it takes a broad enough view to encompass both.
>I second Adam's question, that you haven't said what your objection
>to e-mails as individual files is? I love it--as Chris says,
>incremental backups fly. No more backing up 400 Megs because a few
>new messages have arrived.
As mentioned above, I see massive performance degradation backing up very
large numbers of small files.
Any search requiring a relation that Spotlight doesn't have an index for
will suffer similar degradation. Even on a modern computer, opening 100,000
or a million files takes a long time, much longer than reading the data.
Backups can be argued either way. Yes, with totally naive storage and
organization and a completely file-based backup (no database audit
techniques), small files mean less duplicated data. OTOH, my daily backups
only duplicate a very small part of my email corpus because it's organized
into monthly mailboxes, most of which don't change. The interesting thing
is that I established this organization purely to make it easier for my
mind to handle it, not for the computer -- and yet it seems to serve the
computer well. However, if you prefer to regard the email corpus as one
bundle and view it entirely via Spotlight -- an eminently reasonable
approach -- then you do have different issues with backups.
As a side, my objections to storing email in databases are based on the
fact that such databases don't use proper database backup techniques such
as audit files and thus can't be backed up reasonably, and that the
structure isn't open in a way that allows arbitrary programs to read it
without licensing software. Fix those issues and I'd pick a mail program
that uses a database. At present there's nothing on the horizon that meets
either of these criteria, much less both.
My most immediate objection to a huge number of files, though, is mostly
visceral. I've been programming almost 40 years now, and one thing that's
been pretty constant is that stressing the file system too hard is a Bad
Thing. You can argue and argue that file systems have become much more
robust (I agree) and that in theory they can handle the stress. History
makes me leery of trusting these arguments.
The one other point I'd make is in balancing indexing data vs file data.
When the average file size is only a kilobyte or so, you're reaching the
point that the directories consume nearly as much space as the files --
maybe not within a few percent, but within a small factor. This doesn't
seem to be a good use of the file system. A reasonable ratio of data to
index is a simple and imperfect, but useful, criterion in picking the
storage method.
At 12:12 PM 08/01/2005 -0400, Chris Pepper wrote:
>Backups fly -- you don't have to pick up old messages in incrementals.
See above. Incremental backups fly, full backups don't.
>Also, messages are only written once (and possibly deleted once), so
>message corruption is almost a non-issue.
At the cost of a lot more data in, and writing to, the file system
directory. Personally I'd rather do more file I/O and less directory I/O.
I'd rather have a corrupted message than a corrupted disk directory.
>In contrast, I have lots of Eudora mailboxes containing corrupt messages.
>Dealing with this is painful, and with message-per-file, would not be.
I don't know why the difference, but I've never had problems with corrupted
Eudora mailboxes except when I had a hard disk in the process of going
south, and I had problem with a lot of things then. ;-) This is in six
years of using Eudora, and before that eight years using uAccess, which
used the same mailbox format.
>Basically, this puts a lot more strain on the file system (which seems
>quite able to handle it), which is why message-per-file has been avoided
>on Mac OS X, but that doesn't seem to be a real problem today.
Again, see above. My lack of trust is based on history.
=================================================
I'm not going to argue this at length, and I won't post again unless
there's a specific point to be addressed. File systems are intended to be
used, so there's no perfect argument for using more or fewer files. It's
all in the balance.
Also, I'm not defending mbox format -- it's is a terrible database format,
yet that's basically what it's used as. It's useful because it's completely
open -- no special software needed to read it -- and has a good data/index
ratio.
Edward
Art Works by Melynda Reid:
http://paleo.org