[F] TidBITS  / TidBITS  / TidBITS Talk  /

Sorting out years worth of files

[Roberts, Hank]Hank Roberts (apparently) - 06:27am Nov 28, 2008 PST
via email

A question -- not "archive" not "backup" but "collect a lifetime of files ...." -- approach?

I started with CP/M 86 on a CompuPro with two 8" floppy drives.
I've accumulated files for quite a while.

So far I've just bought bigger and bigger drives and tried to keep
everything available all the time on my main computer.

Is there any good way to take a huge backlog of old files and sort out
duplicate branches high up on the directory tree, not just file by file?

Over the years I've several times had to dump everything into a bigger
drive when one failed -- from a variety of different computers. Each time
I've picked up some duplicates.

The big dupes I can find by size and remove; it's the little ones that are
eating up space like crazy.

I can buy a few terabyte disks and put this off for another year or two,
but eventually ..... it'll become a problem.


Mark as Read
  (older msg: 11)OutlineAll MessagesOlder MessagesOldest MessagesNewest MessagesNewer Messages

David Weintraub (apparently) - Dec 1, 2008 5:23 am (#12 Total: 16)  

Reply to this message
via email  

Photo of Author
Posts: 270
Re: Sorting out years worth of files

> Did you actually try this? On a Mac OS X system? 

Yes, I tested it. Well, sort of. I originally had "-xdev" instead of "-x". But before right before I sent the message, I took one look at the manpage for find, and I noticed that -xdev was not mentioned as an option. I quickly looked at the options and noticed that the manpage specified "-x". I tested with -xdev, but not with "-x".

> second when testing with print0 and xargs -0 on my system I could rarely get it
> to output anything at all (however, I wasn't using the xargs safe flag -X)

I had no problems running this command (using "-xdev" instead of "-x"). The -X xargs safe flag really doesn't do much when you use -print0 and "xargs" with the "-0" parameter. In this case, xargs ignores quotation marks, tabs, and other problem characters and only uses NULL as a separator. That doesn't mean xargs is problem free. A file name can legally contain the NULL character in Unix.

> find . -type f \! -name .DS_Store -exec md5 {} > md5_output.txt \;

> does the same thing without needing xargs.  And it works.  And it's
> much faster.

When you use "-exec" in find, you are calling that command for each and every file found. If you find 1000 files, you are executing that command 1000 times.

The "xargs" command fills up the command line buffer with the list of found files, then calls the command. If you find 1000 files, you are only calling the command once, maybe twice, but not 1000 times. 

I did a simple test on a small directory which contained 75 files. I ran the following command:

$ time find . -type f \! -name .DS_Store -exec md5 {} \; 
And got the following time.
 
real 0m1.84s
user 0m0.06s
sys 0m0.27s

I then ran this command:

$ time find . -type f \! -name .DS_Store -print0 | xargs -0 md5
And got the following time:

real 0m0.30s
user 0m0.01s
sys 0m0.03s

It appears that xargs is much, much faster.

--
David Weintraub
qazwartgmail.com

Lewis Butler (apparently) - Dec 1, 2008 11:56 am (#13 Total: 16)  

Reply to this message
via email  

Photo of Author
Posts: 1136
Re: Sorting out years worth of files

On 1-Dec-2008, at 05:23, David Weintraub wrote:
> I did a simple test on a small directory which contained 75 files. I
> ran the following command:
>
> $ time find . -type f \! -name .DS_Store -exec md5 {} \;
>
> And got the following time.
>
> real 0m1.84s
> user 0m0.06s
> sys 0m0.27s
>
> I then ran this command:
>
> $ time find . -type f \! -name .DS_Store -print0 | xargs -0 md5
>
> And got the following time:
>
> real 0m0.30s
> user 0m0.01s
> sys 0m0.03s

I got exactly opposite results on a directory containing 91 jpgs:

$ time find -x . -type f \! -name .DS_Store -print0 | xargs -0 md5 >
md5_output.txt

real 0m10.040s
user 0m0.437s
sys 0m0.494s
  [cerebus] ~/Documents/June 2005 $ time find . -type f \! -
name .DS_Store -exec md5 {} > md5_output.txt \;

real 0m3.033s
user 0m0.478s
sys 0m0.566s

Note that I was actually using the redirect. xargs took 10 seconds, -
exec took 3.

The only time I saw the xargs be faster was when I ran it over a
folder than contained only plain text file. Then it was much faster.
If there were any binary files, (or possibly just large files?) the
speed dropped drastically.

When I ran it over my entire Documents directory, xargs I gave up on
after 30 minutes. The exec version finished in just about 10 minutes
(18,000 files).

Nigel Stanger (apparently) - Dec 2, 2008 9:43 am (#14 Total: 16)  

Reply to this message
via email - Dunedin, New Zealand  

Photo of Author
Posts: 448
Re: Sorting out years worth of files

On 2/12/2008 1:23 AM, "David Weintraub" <qazwartgmail.com> spake thus:

> I tested with -xdev, but not with "-x".

Note that -x doesn't work at all if you're using the Fink-installed (GNU)
version of find. -xdev is fine.

--
Nigel Stanger, Dunedin, NEW ZEALAND.
http://xri.net/=nigel.stanger


James Grinter - Dec 3, 2008 5:54 am (#15 Total: 16)  

Reply to this message
 

Photo of Author
Re: Sorting out years worth of files

Of course, you may want to also compare the resource-forks of your files, when deciding if two files are duplicates.

This is especially important if you've got files accumulated through many earlier system upgrades. Files generated before Mac OS X may frequently have content in the resource fork and not the data fork, and even with recent Mac OS X files there may be important things there that you need to consider.

johnbaxterlists (apparently) - Dec 4, 2008 2:40 am (#16 Total: 16)  

Reply to this message
via email  

Photo of Author
Posts: 678
Re: Sorting out years worth of files

If a command doesn't seem to match its man page, it could be because
man pages are like that in too many cases.

It could also be the result of a problem I ran into on a machine on
which I did an upgrade install of Leopard over Tiger: the Tiger man
pages did not get replaced although the commands they described did.
(I found the problem as the result of an odd telephone conversation
with my boss, whose man page clearly didn't look like mine for
syslogd.)

I wound up doing a clean install and migrate from bootable clone,
which fixed things.

  --John



  OutlineAll MessagesOlder MessagesOldest MessagesNewest MessagesNewer Messages


 [F] TidBITS  / TidBITS  / TidBITS Talk  / Sorting out years worth of files




Add a message

To add a message to this discussion, you must be a registered user. Enter your email address below. If you have an account associated with the email address you enter, you will be prompted for your password. If not, you'll be able to create a new account with no fuss.

Enter your email address:

Submit