Mitch Fincher: The Distracted Programmer: How to Find Duplicate Images on a Mac or Linux Machine

Wednesday, May 04, 2022

How to Find Duplicate Images on a Mac or Linux Machine

This is how to search recursively all directories on a Mac or Linux machine for duplicate images with a single line of awkward bash script. This method will find duplicates anywhere on your disk below your current directory (try "~"), and find multiple versions.

Many commercial products exist to easily find and delete duplicate images like those reviewed here, but if you are like me, and don't like to download apps willy nilly for a single task, and have a bit of shell scripting experience, you can use the following line of tortured bash script to find duplicate files.

find . -type f $ -name "*.jpg" -o -name "*.gif" $ | awk '{print "\"" $0 "\""}' | xargs shasum -a 256 | sort > checksumAndFilename.tmp && cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp && grep -f checksum.tmp checksumAndFilename.tmp | tee duplicates.tmp && echo "output in \"duplicates.tmp\"" && rm checksumAndFilename.tmp checksum.tmp

The basic idea is to search the current directory and all subdirectories for images files, calculate a hash for each file, then sort the hashes and then list adjacent hashes, which would be duplicates. (If this is too much for your brain, just go to imymac and buy an app.)

Ok, let's go through the command in detail.

Get all the image files in your directory and below. Update "*.jpg" to "*.png" or whatever you need.
find . -type f $ -name "*.jpg" -o -name "*.gif" $
Surround the file name with double quotes, since some people still insist on the horrible, dasterdardly, awful practice of including spaces in names.
awk '{print "\"" $0 "\""}'
Pipe the names of the files into shasum to generate a hash
xargs shasum -a 256

Sort by the hash value so duplicates will be adjacent and write to a temp file

sort > checksumAndFilename.tmp

checksumAndFilename.tmp looks like this. The files with the same hash value would be duplicates.

ff45b77226369d27b67772e72dfe8dc3387eff06  ./2010-07-04-2224-July4_036.jpg
ff65e3611973092e61127439af6b3c82d0ee055a  ./2010-12-29-1408-IMG_9638.jpg
ff680170b0451868a1bda027c801b78f55067366  ./2010-12-24-1010-IMG_9235.jpg
ff918f6f8230deb3cd2208602dadb5c6f88039dc  ./2010-03-14-2025-IPhone_8146.jpg

We are almost done, but how to only see hash values that are duplicates?

Get only the hash values that are duplicates
cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp

checksum.tmp looks like this. This are only the hash values that are duplicated.

0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2
075a137c8857c8b38555cf632d906ed0581b9224

We have only the duplicated hash values. Let's match the hashes back with their filenames
grep -f checksum.tmp checksumAndFilename.tmp

We can see the first two are duplicates, and the next two are as well.

  
0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2  ./2010-11-28-0926-IMG_0300.jpg
0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2  ./IMG_0300.jpg
075a137c8857c8b38555cf632d906ed0581b9224  ./2010-06-08-photoshoot012.jpg
075a137c8857c8b38555cf632d906ed0581b9224  ./2010-06-08-photoshoot_012.jpg

Write to the output file and the screen
tee duplicates.tmp
Let's remind ourselves where the output lives
echo "output in \"duplicates.tmp\""
Clean up our mess
rm checksumAndFilename.tmp checksum.tmp

My gut tells me there's some ways to clean this script up. Please add a comment if you can improve the script.

1 comment:

Mitch Fincher said...: To cut down on the size of the final hash you can add "| cut -c 60- " right before writing to the duplicates file.; 1:37 PM