Monday, July 4, 2011

How to remove duplicate files without wasting time

Duplicate files can enter in your computer in many ways. No matter how it happened, they should be removed as soon as possible. Waste is waste: why should you tolerate it? It’s not just a matter of principle: duplicates make your backups, not to mention indexing with Nepomuk or similar engines, take more time than it’s really necessary. So let’s get rid of them.

First, let’s find which files are duplicates

Whenever I want to find and remove duplicate files automatically I run two scripts in sequence. The first is the one that actually finds which files are copies of each other. I use for this task this small gem by J. Elonen, pasted here for your convenience:
  #! /bin/bash
  OUTF=rem-duplicates.sh;
  echo "#! /bin/sh" > $OUTF;
  echo ""                >> $OUTF;
  find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
  chmod a+x $OUTF
In this script, which I call find_dupes.sh, all the real black magic happens in the sixth line. The original page explains all the details, but here is, in synthesis, what happens: first, xargs calculates the MD5 checksum of all the files found in all the folders passed as arguments to the script. Next, sort and uniq extract all the elements that have a common checksum (and are, therefore, copies of the same file) and build a sequence of shell commands to remove them. Several options inside the script, explained in the original page, make sure that things will work even if you have file names with spaces or non ASCII characters. The result is something like this (from a test run made on purpose for this article):
  [marco@polaris ~]$ find_dupes.sh /home/master_backups/rule /tmp/rule/
  [marco@polaris ~]$ more rem-duplicates.sh
  #! /bin/sh
  #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
  #rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl
  #all other duplicates...
As you can see, the script does find the duplicates (in the sample listing above, there are four copies of makefile.pl in three different folders) but lets you decide which one to keep and which ones to remove, that is, which lines you should manually uncomment before executing rem-duplicates.sh. This manual editing can consume so much time you’ll feel like throwing the computer out of the window and going fishing.
Luckily, at least in my experience, this is almost never necessary. In practically all the cases in which I have needed to find and remove duplicates so far, there always was:
  • one original folder,(/home/master_backups/” in this example) whose content should remain untouched.
  • all the unnecessary copies scattered over many other, more or less temporary folders and subfolders (that, in our exercise, all are inside /tmp/rule/).
If that’s the case, there’s no problem to massage the output of the first script to generate another one that will leave alone the first copy in the master folder and remove all the others. There are many ways to do this. Years ago, I put together these few lines of Perl to do it and they serve me well, but you’re welcome to suggest your preferred alternative in the comments:
  1 #! /usr/bin/perl
  2
  3 use strict;
  4 undef $/;
  5 my $ALL = <>;
  6 my @BLOCKS = split (/\n\n/, $ALL);
  7
  8    foreach my $BLOCKS (@BLOCKS) {
  9      my @I_FILE = split (/\n/, $BLOCKS);
  10    my $I;
  11    for ($I = 1; $I <= $#I_FILE; $I++) {
  12           substr($I_FILE[$I], 0,1) = '     ';
  13           }
  14   print join("\n", @I_FILE), "\n\n";
  15 }
This code puts all the text received from the standard input inside $ALL, and then splits it in @BLOCKS, using two consecutives newlines as blocks separator (line 6). Every element of each block is then split in one array of single lines (@I_FILE in line 9). Next, the first character of all but the first element of that array (which, if you’ve been paying attention, was the shell comment character, ‘#’) is replaced by four white spaces. One would be enough, but code indentation is nice, isn’t it?
When you run this second script (I call it dup_selector.pl) on the output of the first one, here’s what you get:
  [marco@polaris ~]mce_markernbsp; ./new_dup_selector.pl rem-duplicates.sh > remove_copies.sh
  [marco@polaris ~]mce_markernbsp; more remove_copies.sh
  #! /bin/sh
  #rm /home/master_backups/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/bis/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/rule_new/old/RULE/public_html/en/test/makefile.pl
       rm /tmp/rule/zzz/rule_new/old/RULE/public_html/en/test/makefile.pl
  ....
Which is exactly what we wanted, right? If the master folder doesn’t have a name that puts it as the first element, you can temporarily change its name to something that will, like /home/0. What’s left? Oh, yes, cleaning up! After you’ve executed remove_copies.sh, /tmp/rule will contain plenty of empty directories, that you want to remove before going there with your file manager and look at what’s left without wasting time by looking inside empty boxes.

How to find and remove empty directories

Several websites suggest some variant of this command to find and remove all the empty subdirectories:
find -depth -type d -empty -exec rmdir {} \;
This goes down in the folder hierarchy (-depth), finds all the objects that are directories AND are empty (-type d -empty) and executes on them the rmdir command. It works… unless there is some directory with spaces or other weird characters in its name. That’s why I tend to use a slightly more complicated command for this purpose:
  [marco@polaris ~]$ find .  -depth -type d -empty | while read line ; do  echo -n "rmdir '$line" ; echo "'"; done > rmdirs.sh
  [marco@polaris ~]$ cat rmdirs.sh
  rmdir 'rule/slinky_linux_v0.3.97b-vumbox/images'
  rmdir 'rule/slinky_linux_v0.3.97b-vumbox/RedHat/RPMS'
  ...
  [marco@polaris ~]$ source rmdirs.sh
Using the while loop creates a command file (rmdirs.sh) that wraps each directory name in single quotes, so that the rmdir command always receives one single argument. This always works… with the obvious exception of names that contain single quotes! Dealing with them requires some shell quoting tricks that… we’ll cover in another post! For now, you know that whenever you have duplicate files to remove quickly, you can do it by using the two scripts shown here in sequence. Have fun!

1 comment: