Wednesday, May 04, 2022

How to Find Duplicate Images on a Mac or Linux Machine

This is how to search recursively all directories on a Mac or Linux machine for duplicate images with a single line of awkward bash script. This method will find duplicates anywhere on your disk below your current directory (try "~"), and find multiple versions.

Many commercial products exist to easily find and delete duplicate images like those reviewed here, but if you are like me, and don't like to download apps willy nilly for a single task, and have a bit of shell scripting experience, you can use the following line of tortured bash script to find duplicate files.

find . -type f \( -name "*.jpg" -o -name "*.gif" \) | awk '{print "\"" $0 "\""}' | xargs shasum -a 256 | sort > checksumAndFilename.tmp && cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp && grep -f checksum.tmp checksumAndFilename.tmp | tee duplicates.tmp && echo "output in \"duplicates.tmp\"" && rm checksumAndFilename.tmp checksum.tmp

The basic idea is to search the current directory and all subdirectories for images files, calculate a hash for each file, then sort the hashes and then list adjacent hashes, which would be duplicates. (If this is too much for your brain, just go to imymac and buy an app.)

Ok, let's go through the command in detail.

  1. Get all the image files in your directory and below. Update "*.jpg" to "*.png" or whatever you need.

    find . -type f \( -name "*.jpg" -o -name "*.gif" \)

  2. Surround the file name with double quotes, since some people still insist on the horrible, dasterdardly, awful practice of including spaces in names.

    awk '{print "\"" $0 "\""}'

  3. Pipe the names of the files into shasum to generate a hash

    xargs shasum -a 256

  4. Sort by the hash value so duplicates will be adjacent and write to a temp file

    sort > checksumAndFilename.tmp

    checksumAndFilename.tmp looks like this. The files with the same hash value would be duplicates.

    ff45b77226369d27b67772e72dfe8dc3387eff06  ./2010-07-04-2224-July4_036.jpg
    ff65e3611973092e61127439af6b3c82d0ee055a  ./2010-12-29-1408-IMG_9638.jpg
    ff680170b0451868a1bda027c801b78f55067366  ./2010-12-24-1010-IMG_9235.jpg
    ff918f6f8230deb3cd2208602dadb5c6f88039dc  ./2010-03-14-2025-IPhone_8146.jpg
    

    We are almost done, but how to only see hash values that are duplicates?

  5. Get only the hash values that are duplicates

    cat checksumAndFilename.tmp | awk '{print $1}' | uniq -D | uniq > checksum.tmp

  6. checksum.tmp looks like this. This are only the hash values that are duplicated.

    0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2
    075a137c8857c8b38555cf632d906ed0581b9224
    
  7. We have only the duplicated hash values. Let's match the hashes back with their filenames

    grep -f checksum.tmp checksumAndFilename.tmp

  8. We can see the first two are duplicates, and the next two are as well.

      
    0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2  ./2010-11-28-0926-IMG_0300.jpg
    0526e5586cc1e4d2d97e5cc813c8d9b698bc3df2  ./IMG_0300.jpg
    075a137c8857c8b38555cf632d906ed0581b9224  ./2010-06-08-photoshoot012.jpg
    075a137c8857c8b38555cf632d906ed0581b9224  ./2010-06-08-photoshoot_012.jpg
    
  9. Write to the output file and the screen

    tee duplicates.tmp

  10. Let's remind ourselves where the output lives

    echo "output in \"duplicates.tmp\""

  11. Clean up our mess

    rm checksumAndFilename.tmp checksum.tmp

My gut tells me there's some ways to clean this script up. Please add a comment if you can improve the script.

1 comment:

Mitch Fincher said...

To cut down on the size of the final hash you can add "| cut -c 60- " right before writing to the duplicates file.