Brain Drippings

Finding Duplicate Files

by Kreme on Nov.28, 2008, under Computer

So, the question came up on how to search a lot of files for duplicates, and I dashed off a quick, albeit correct, answer.  But then, as these things often do, it niggled and wormed around and I decided on a better method.  It all starts with a short little find command:

find . -type f -exec md5 {} >>mymd5s.txt \;

find = find command
.    = current directory (and any directories 'below')
-type= only search for 'f' files (actual files, not directories)
-exec= then execute the following command for each file found
md5        = calculate an md5 hash
{}         = of the file we found with find
>>         = redirect output without overwriting to
mymd5s.txt = the name of the file

mymd5s.txt will contain lines like this:

MD5 (./html templates/item.html) = 8d7d85fdce77e3cab050cfebdf04ad8f
MD5 (./html templates/news.rss) = 62eb138624a4e6781c0f4e92d4628fea
MD5 (./html templates/newsitem.rss) = aa98d24128b7f321ece9111efb0856a6
MD5 (./html templates/normal.html) = 03bf0d92d0a58c0297f147828640ecb3
MD5 (./html templates/sidebar.html) = ea164131610c6081e563f9a47a4ac2d4

the part in ()’s is the relative path to the file, and then the md5 hash follows the =.  I can then use BBEdit to find the duplicates:

And, to make it easier to see the duplicates, I can also sort the lines based on the md5:

For example, I find a lot of duplicate .DS_STORE files:

MD5 (./Genealogy/Royal Family/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./iChats/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./iChats/2008-04-05/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./iChats/2008-04-26/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./iChats/2008-04-27/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./iChats/2008-05-22/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c
MD5 (./June 2005/.DS_Store) = 194577a7e20bdcc7afbb718f502c134c

But here is an actual duplicate file that exists in my Documents:

MD5 (./Personal/Writing/Trumpet (word)) = 1983845ed62444109dbeb378ba19bf86
MD5 (./Writing/Trumpet (word).doc) = 1983845ed62444109dbeb378ba19bf86

As you can see, I have probably duplicated a directory (Writing) under “Personal”.

If I didn’t want to use BBEdit, or if I was dealing with many thousands of duplicates, I could import that file into a spreadsheet (replace ^(.*) = (.*)$ with “\1″, “\2″ and you have a csv file), or even into a database where it would be trivial to find all the duplicate hashes. While it is technically possible for two different files to generate an identical md5 hash, the odds are billions to one. The advantage to md5 is that it is quite fast to calculate, and the command is already on your Mac.

Also, since I have the path to the file, it is trivial to automatically delete (or perhaps move to a different directory before deleting) all those extra duplicate files.

The down side to this is the amount of time it will take to create the md5 hashes of all the files.  I ran it over 18,458 files in my Documents folder on a quad-core Mac Pro 2.0Ghz and it took nearly 11 minutes.

 [~/Documents] $ time find . -type f -exec md5 {} >>mymd5s.txt \;

real	10m52.277s
user	0m54.911s
sys	1m17.949s
 [~/Documents] $  wc -l mymd5s.txt
   18458 mymd5s.txt
No comments for this entry yet...

Leave a Reply

You must be logged in to post a comment.

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...