I’m cleaning up my giant unorganised mess of photos that I’ve accumulated over the years, and I felt my process was worth documenting as based on Stack Overflow questions a lot of other people are trying to do the same thing.
My photos “album” currently looks something like this:
- Tens of thousands of images, going back many years, taken on many different devices.
- Not organised in any orderly fashion, some are in YYYY-MM directories, some are flat, some are in “Holidays 2012” or similar.
- Filenames are whatever the camera gave me, so lots of IMG_0001.JPG and other names that look like they came from random number generators.
- Tons of duplication everywhere at different resolutions or quality settings as they got passed around through WhatsApp or Facebook or whatever.
Not a great place to be, so let’s fix things up a bit!
Step 1: Install the tools we need
We won’t need much here and all of these should be available in your package manager. Go ahead and install fdupes, findimagedupes, geeqie and exiftool.
Step 2: Duplicate to a working directory if possible
Because we’re going to be running some destructive commands that will rename and delete files, it makes sense to first copy everything to another directory that we can work safely in. That way if we make any mistakes and delete anything we shouldn’t do, we can still get back to where we started at any time. Once everything is organised and looks beautiful, we can move it all back to where it’s supposed to be and remove the redundant copy.
Please be sure to follow the next steps in the newly created directory so we don’t damage the originals!
Step 3: Delete duplicate files
So as an easy first step, we can delete all the exact duplicates of files. That is to say, the files that share the same md5 and are bit by bit identical. fdupes
is the answer here, and if we want to be aggressive we can run it like this:
Replace DIR with your directory. If you want to be a bit more careful and test it out first to see what gets flagged, then run drop the –delete –noprompt flags to see what it would find.
Step 4: Delete extremely similar images
So this is the more awkward part and where you might want to go your own way rather than follow my strategy. The issue is that we may have pictures which are visually exact duplicates of each other but have small differences in the files themselves, so fdupes doesn’t consider them as duplicates to delete. We also want to delete away identical pictures that are at different resolutions, ideally keeping the largest one.
For finding very similar images we can use the findimagedupes
library. The very first step is to create a fingerprint database of all our images as this will speed the process up a lot later.
Obviously you don’t need to put the results in /tmp/ if you don’t want to, but it works for me. Depending on how many pictures you have this might take a little while so lets be patient.
Whilst we are waiting for that command to run here is my preferred method for finding and removing our near-duplicates:
- Run findimagedupes again, and have it rename the smaller unwanted files by appending .TODELETE on the end.
- At the same time, findimagedupes can output the results into a collection that we can open with
Geeqie
to look through, and give ourselves comfort we have only found duplicates. - If we don’t want to delete a given file, we can rename it to drop the .TODELETE at the end.
- When we’re satisfied, we can delete everything with .TODELETE on the end.
findimagedupes also lets us give it a threshold for how similar the images are. We’ll use this to follow the above steps multiple times, gradually increasing the tolerance so we aren’t overwhelmed all at once.
So lets go ahead and run this command first to create a collection we can inspect afterwards. No renaming yet or Geeqie won’t display the files properly.
Thanks to the fingerprinting we’ve already done this should execute really quickly. Now fire up Geeqie and open up /tmp/dupes100 to see what we have here. Look closely because even at 100% there can still be matches where you want to keep both (like those touched up in Photoshop).
So now we have many pairs of matching photographs how do we best decide which one to delete? Some apps let us just delete the “first” one and keep the second, but because we have pictures at different resolutions we don’t want to do that. There are multiple approaches we could take, but in my case I took the simpler approach and decided to keep the biggest file and delete the smaller ones. We can do this with our trusty findimagedupes tool as well.
findimagedupes allows us to give another shell script as an argument, which it will execute for each group of matched images. Go ahead and take this short script and save it somewhere you can find later:
#!/usr/bin/bash
VIEW() {
biggest=-1
tokeep=""
echo "Analysing New Group"
# Find the biggest in the group
for f in "$@";do
size=$(du -b "$f" | cut -f1)
echo "$f: size in bytes: $size"
if [ $size -gt $biggest ]; then
biggest="$size"
tokeep="$f"
fi
done
# rename any file that is not the one to keep
for f in "$@";do
if [ "$f" != "$tokeep" ]; then
mv "$f" "$f.TODELETE"
echo "deleting $f"
else
echo "keeping: $f"
fi
done
}
The important part of the script is it’s going to leave us only the biggest file of the duplicate group and set the smaller ones to be deleted. The logic could be modified though to look at any number of metrics to determine which is the “best” file to keep.
So with this saved somewhere, lets execute the following
The final redirect to /tmp/dupelogs is handy, it can be interesting to look through the output and see why it chose to delete what it did.
Now we have a last chance to rename some files back if we don’t want them to be deleted. You can run this command below on a directory to recursively rename them back:
So now life should be good, and we can go ahead and delete them
Now we can simply repeat the process with a 99% threshold, then 98% etc until we’re happy with the state of things. I actually stopped at 98% because 97% was starting to show some strange matches and I didn’t want to take any risks deleting more pictures.
And that’s it! The hardest part is over..
Step 5: Renaming pictures to their create time
So naturally filenames like IMG_0001.JPG isnt going to be very helpful, we want our files to have sensible names like 2020-09-10 14.14.22.jpg. And fortunately there is a way to make exactly that happen.
Cameras store metadata for each image in an Exif file format which we can view using exiftool
. Running exiftool on a recent photo and grepping for anything that looks like a date, I get something like this:
$ exiftool 2020-11-07_151553.jpg | egrep "2[0-9][0-9][0-9][:-][0-1][0-9]"
File Name : 2020-11-07_151553.jpg
File Modification Date/Time : 2020:11:07 15:15:54+08:00
File Access Date/Time : 2020:12:04 13:42:16+08:00
File Inode Change Date/Time : 2020:12:04 20:22:42+08:00
Modify Date : 2020:11:07 15:15:53
Date/Time Original : 2020:11:07 15:15:53
Create Date : 2020:11:07 15:15:53
GPS Date Stamp : 2020:11:07
Create Date : 2020:11:07 15:15:53.921607
Date/Time Original : 2020:11:07 15:15:53.921607
Modify Date : 2020:11:07 15:15:53.921607
GPS Date/Time : 2020:11:07 07:15:52Z
Theres quite a lot of stuff there, and in our case what we want is the CreateDate field which represents when the photo was taken. To recursively rename all our files to when they were created we can run this command:
If there is a collision with another file that has the same timestamp then the %c in there will be 1, 2, 3 etc.
One thing that might happen is you may find that some files don’t have the CreateDate exif field at all. I found the best way to solve that was to look at those where it didn’t work and look at what exif data it did have and use those fields instead. For example for some Facebook pictures I had to use ““-FileName<ProfileDateTime” instead. Another candidate to try is “-FileName<DateTimeOriginal”.
Step 6: Putting pictures into month folders
This is a relatively trivial step but since we’ve come this far we might as well do it. Assuming now all our files have the right naming convention, we can just loop over each file and put them into a month folder. I have this script in my Camera Uploads folder that I run from time to time to clean things up. The find command will only look at files that begin with YYYY-MM, so it will safely exclude the directories already created as well as itself.
#!/usr/bin/bash
find . -maxdepth 1 -type f -name '[0-9][0-9][0-9][0-9]-[0-9][0-9]*' |
while read f
do
month="${f:2:7}"
mkdir $month 2> /dev/null
mv "$f" $month
done
And there we are. Thank you for coming with me on this journey and I hope you discovered something useful! :)