TidBITS TidBITS TidBITS Talk 
Sorting out years worth of files Hank Roberts (apparently) - 06:27am Nov 28, 2008 PSTvia emailA question -- not "archive" not "backup" but "collect a lifetime of files ...." -- approach?
I started with CP/M 86 on a CompuPro with two 8" floppy drives.
I've accumulated files for quite a while.
So far I've just bought bigger and bigger drives and tried to keep
everything available all the time on my main computer.
Is there any good way to take a huge backlog of old files and sort out
duplicate branches high up on the directory tree, not just file by file?
Over the years I've several times had to dump everything into a bigger
drive when one failed -- from a variety of different computers. Each time
I've picked up some duplicates.
The big dupes I can find by size and remove; it's the little ones that are
eating up space like crazy.
I can buy a few terabyte disks and put this off for another year or two,
but eventually ..... it'll become a problem.
Mark as Read
David Weintraub (apparently)
-
Nov 30, 2008 7:05 am
(#9 Total: 16)
|
 |
|
|
 |
| Posts: 270 |
Re: Sorting out years worth of files
Alright. Hold your breath:
$ sudo find . -x -type f \! -name .DS_Store -print0 | xargs -0 md5 > md5_output.txt
A little explanation of this command:
The "find" command finds files on your hard drive.
The "-x" prevents your find command from searching other disks that might be connected to your Mac or being shared on your Mac.
The "-type f" parameter limits searches just to files.
The "\! -name .DS_Store" tells the find command not to find files with the name ".DS_Store"
The "-print0" works with the xargs command. It basically says instead of printing a file name on each line (which is what the find command normally does), print out a file name followed by a NULL character. You need to do this if you expect that your find command will return file names with spaces in them.
The "|" pipe symbol takes the output of the command before the pipe and uses it as input to the command after the pipe. So, basically the output from the find command will be used as input to the xargs command.
The "xargs" command is rather interesting because what it does is take the output of the find command and feeds it to the command name that follows the find command. In this case, it is the "md5" command that returns the md5 signature of the file. The "-0" parameter of the xargs command says that the names of the files will be separated by the NULL character instead of white spaces. That's why we used the -print0 option on the find command. Otherwise, xargs will simply assume that each file name is separated by a carriage return or a space.
The ">" redirects the output to be placed in a file called "md5_output.txt".
In the end, you will have all of your files listed with their MD5 hash in the md5_output.txt file. So far, so good. What you really want is to find the duplicate md5 hashes. So, we must strip out the file names and sort by the md5 hash:
$ sed 's/^.*) = //' md5_output.txt | sort | uniq -c | grep -v "^ *1" > sorted_md5_output.txt
This will create a file called "sorted_md5_ouput.txt" with a bunch of lines that look like this:
4 6eb14b7698f73179227e743e3e55cb26 2 786c44a435aa167c2df93cb71cf15b00 2 821bfd9fadf75da2937f43c96b2fe217 5 b909002135c33f47f9be0d49e268e8f4
The first number is the number of files you have of that md5 hash, and the second is the md5 hash. What you don't have are the file names.
Fortunately, you can search the first file (md5_ouput.txt) for the md5 has of the duplicates, and delete all but one of the duplicates.
|
|
 |  |
Lewis Butler (apparently)
-
Nov 30, 2008 4:29 pm
(#10 Total: 16)
|
 |
|
|
 |
| Posts: 1136 |
Re: Sorting out years worth of files
On 30-Nov-2008, at 07:05, David Weintraub wrote:
> $ sudo find . -x -type f \! -name .DS_Store -print0 | xargs -0
> md5 > md5_output.txt
Did you actually try this? On a Mac OS X system? First,, the -x has
to come before the path declaration (so, "find -x ."); second when
testing with print0 and xargs -0 on my system I could rarely get it to
output anything at all (however, I wasn't using the xargs safe flag -
X); third, sudo is overkill and not needed. We've established these
are document files owned by the user.
find . -type f \! -name .DS_Store -exec md5 {} > md5_output.txt \;
does the same thing without needing xargs. And it works. And it's
much faster.
|
|
 |  |
Treadway1 (apparently)
-
Nov 30, 2008 4:29 pm
(#11 Total: 16)
|
 |
|
|
 |
| Posts: 1 |
Re: Sorting out years worth of files
> find . -x -type f \! -name .DS_Store -print0 | xargs -0 md5 >
> md5_output.txt
Seems to be a bug in find or the man page. I get:
find: -x: unknown option
How ever:
find -x . -type f \! -name .DS_Store -print0 | xargs -0 md5 >
md5_output.txt
and
find . -xdev -type f \! -name .DS_Store -print0 | xargs -0 md5 >
md5_output.txt
seems to work ok.
trt
|
|
 |  |
David Weintraub (apparently)
-
Dec 1, 2008 5:23 am
(#12 Total: 16)
|
 |
|
|
 |
| Posts: 270 |
Re: Sorting out years worth of files
> Did you actually try this? On a Mac OS X system?
Yes, I tested it. Well, sort of. I originally had "-xdev" instead of "-x". But before right before I sent the message, I took one look at the manpage for find, and I noticed that -xdev was not mentioned as an option. I quickly looked at the options and noticed that the manpage specified "-x". I tested with -xdev, but not with "-x".
> second when testing with print0 and xargs -0 on my system I could rarely get it > to output anything at all (however, I wasn't using the xargs safe flag -X)
I had no problems running this command (using "-xdev" instead of "-x"). The -X xargs safe flag really doesn't do much when you use -print0 and "xargs" with the "-0" parameter. In this case, xargs ignores quotation marks, tabs, and other problem characters and only uses NULL as a separator. That doesn't mean xargs is problem free. A file name can legally contain the NULL character in Unix.
> find . -type f \! -name .DS_Store -exec md5 {} > md5_output.txt \; > > does the same thing without needing xargs. And it works. And it's > much faster.
When you use "-exec" in find, you are calling that command for each and every file found. If you find 1000 files, you are executing that command 1000 times.
The "xargs" command fills up the command line buffer with the list of found files, then calls the command. If you find 1000 files, you are only calling the command once, maybe twice, but not 1000 times.
I did a simple test on a small directory which contained 75 files. I ran the following command:
$ time find . -type f \! -name .DS_Store -exec md5 {} \; And got the following time.
real 0m1.84s user 0m0.06s sys 0m0.27s
I then ran this command:
$ time find . -type f \! -name .DS_Store -print0 | xargs -0 md5 And got the following time:
real 0m0.30s
user 0m0.01s sys 0m0.03s
It appears that xargs is much, much faster.
-- David Weintraub qazwart gmail.com
|
|
 |  |
Lewis Butler (apparently)
-
Dec 1, 2008 11:56 am
(#13 Total: 16)
|
 |
|
|
 |
| Posts: 1136 |
Re: Sorting out years worth of files
On 1-Dec-2008, at 05:23, David Weintraub wrote:
> I did a simple test on a small directory which contained 75 files. I
> ran the following command:
>
> $ time find . -type f \! -name .DS_Store -exec md5 {} \;
>
> And got the following time.
>
> real 0m1.84s
> user 0m0.06s
> sys 0m0.27s
>
> I then ran this command:
>
> $ time find . -type f \! -name .DS_Store -print0 | xargs -0 md5
>
> And got the following time:
>
> real 0m0.30s
> user 0m0.01s
> sys 0m0.03s
I got exactly opposite results on a directory containing 91 jpgs:
$ time find -x . -type f \! -name .DS_Store -print0 | xargs -0 md5 >
md5_output.txt
real 0m10.040s
user 0m0.437s
sys 0m0.494s
[cerebus] ~/Documents/June 2005 $ time find . -type f \! -
name .DS_Store -exec md5 {} > md5_output.txt \;
real 0m3.033s
user 0m0.478s
sys 0m0.566s
Note that I was actually using the redirect. xargs took 10 seconds, -
exec took 3.
The only time I saw the xargs be faster was when I ran it over a
folder than contained only plain text file. Then it was much faster.
If there were any binary files, (or possibly just large files?) the
speed dropped drastically.
When I ran it over my entire Documents directory, xargs I gave up on
after 30 minutes. The exec version finished in just about 10 minutes
(18,000 files).
|
|
 |  |
Nigel Stanger (apparently)
-
Dec 2, 2008 9:43 am
(#14 Total: 16)
|
 |
|
|
via email - Dunedin, New Zealand |
|
|
 |
| Posts: 448 |
Re: Sorting out years worth of files
On 2/12/2008 1:23 AM, "David Weintraub" <qazwart  gmail.com> spake thus:
> I tested with -xdev, but not with "-x".
Note that -x doesn't work at all if you're using the Fink-installed (GNU)
version of find. -xdev is fine.
--
Nigel Stanger, Dunedin, NEW ZEALAND.
http://xri.net/=nigel.stanger
|
|
 |  |
James Grinter
-
Dec 3, 2008 5:54 am
(#15 Total: 16)
|
 |
|
|
Re: Sorting out years worth of files
Of course, you may want to also compare the resource-forks of your files, when deciding if two files are duplicates.
This is especially important if you've got files accumulated through many earlier system upgrades. Files generated before Mac OS X may frequently have content in the resource fork and not the data fork, and even with recent Mac OS X files there may be important things there that you need to consider.
|
|
 |  |
johnbaxterlists (apparently)
-
Dec 4, 2008 2:40 am
(#16 Total: 16)
|
 |
|
|
 |
| Posts: 678 |
Re: Sorting out years worth of files
If a command doesn't seem to match its man page, it could be because
man pages are like that in too many cases.
It could also be the result of a problem I ran into on a machine on
which I did an upgrade install of Leopard over Tiger: the Tiger man
pages did not get replaced although the commands they described did.
(I found the problem as the result of an odd telephone conversation
with my boss, whose man page clearly didn't look like mine for
syslogd.)
I wound up doing a clean install and migrate from bootable clone,
which fixed things.
--John
|
|
|
TidBITS TidBITS TidBITS Talk Sorting out years worth of files
|
|