This is how to search recursively all directories on a Mac or Linux machine for duplicate images with a single line of awkward bash script. This method will find duplicates anywhere on your disk below your current directory (try "~"), and find multiple versions.
Many commercial products exist to easily find and delete duplicate images like those reviewed here, but if you are like me, and don't like to download apps willy nilly for a single task, and have a bit of shell scripting experience, you can use the following line of tortured bash script to find duplicate files.find . -type f \( -name "*.jpg" -o -name "*.gif" \) | awk '{print "\"" $0 "\""}' | xargs shasum -a 256 | sort > checksumAndFilename.tmp && cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp && grep -f checksum.tmp checksumAndFilename.tmp | tee duplicates.tmp && echo "output in \"duplicates.tmp\"" && rm checksumAndFilename.tmp checksum.tmp
The basic idea is to search the current directory and all subdirectories for images files, calculate a hash for each file, then sort the hashes and then list adjacent hashes, which would be duplicates. (If this is too much for your brain, just go to imymac and buy an app.)
Ok, let's go through the command in detail.
- Get all the image files in your directory and below. Update "*.jpg" to "*.png" or whatever you need.
find . -type f \( -name "*.jpg" -o -name "*.gif" \)
- Surround the file name with double quotes, since some people still insist on the horrible, dasterdardly, awful practice of including spaces in names.
awk '{print "\"" $0 "\""}'
- Pipe the names of the files into shasum to generate a hash
xargs shasum -a 256
- Sort by the hash value so duplicates will be adjacent and write to a temp file
sort > checksumAndFilename.tmp
checksumAndFilename.tmp looks like this. The files with the same hash value would be duplicates.
ff45b77226369d27b67772e72dfe8dc3387eff06 ./2010-07-04-2224-July4_036.jpg ff65e3611973092e61127439af6b3c82d0ee055a ./2010-12-29-1408-IMG_9638.jpg ff680170b0451868a1bda027c801b78f55067366 ./2010-12-24-1010-IMG_9235.jpg ff918f6f8230deb3cd2208602dadb5c6f88039dc ./2010-03-14-2025-IPhone_8146.jpg
We are almost done, but how to only see hash values that are duplicates?
- Get only the hash values that are duplicates
cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp
- We have only the duplicated hash values. Let's match the hashes back with their filenames
grep -f checksum.tmp checksumAndFilename.tmp
- Write to the output file and the screen
tee duplicates.tmp
- Let's remind ourselves where the output lives
echo "output in \"duplicates.tmp\""
- Clean up our mess
rm checksumAndFilename.tmp checksum.tmp
checksum.tmp looks like this. This are only the hash values that are duplicated.
0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2 075a137c8857c8b38555cf632d906ed0581b9224
We can see the first two are duplicates, and the next two are as well.
0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2 ./2010-11-28-0926-IMG_0300.jpg 0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2 ./IMG_0300.jpg 075a137c8857c8b38555cf632d906ed0581b9224 ./2010-06-08-photoshoot012.jpg 075a137c8857c8b38555cf632d906ed0581b9224 ./2010-06-08-photoshoot_012.jpg
My gut tells me there's some ways to clean this script up. Please add a comment if you can improve the script.
1 comment:
To cut down on the size of the final hash you can add "| cut -c 60- " right before writing to the duplicates file.
Post a Comment