Monthly Archives: April 2014

Finding duplicates in a file system

Hi everyone,

as for me, it’s been quite a hard time to find a reasonable method for finding file duplicates in a Linux or OS X file system. Especially, when lessfs is not an option at the moment. In particular, copies of some files can have been renamed and thus be more difficult to identify as duplicates.

This has been quite often the case when dealing with digital photographs. Some collections from a smartphone may have been downloaded more than once, or maybe just preparing some wedding photo albums has produced more copies on your SSD than you actually intended. Anyway, afterwards it is usually good to make sure I have only the copies I really want. All other stuff – apart from voluntary backup – means actually uselessly occupied space on the device.

Of course, there are some tools with a GUI for this, but in times of ubiquitous cloud solutions, it might feel good to have a simple, technical solution where it is easy to know exactly what is happening. So my idea was to write a tool using the Bash shell and traditional command-line tools, going by file size (as it is pretty fast to obtain) and comparing file contents only if necessary, i.e. when two files really have identical size. As a result, files with identical content are to be made visible and suggestions of file deletion are to be made. The whole solution should be as efficient as possible.

The solution

Please note that the author takes no responsibility for any results of copying or using parts of or entire methods or program codes contained in this article. Anything the reader decides to do remains his sole responsibility. 

For the sake of simplicity, we assume our files do not contain the paragraph character “§”. In my opinion this is not a big restriction, as files with names containing this character would just not be taken into account, that’s all. Some other time one can think of a better solution. We also omit files with the name “.DS_Store” as they are usually not what we are interested in. In general, we take into account files in the current directory tree the script is executed in.

So here is the script I wrote:

#!/bin/sh
listfile=/tmp/sfd_$$_list.out
cmpfile=/tmp/sfd_$$_cmp.out
resfile=/tmp/sfd_$$_res.out
find . -type f -exec ls -l {} \; | grep -v § | grep -v DS_Store | sed 's/ \.\// §.\//g' | awk '{print $5 "§" $0}' | awk -F§ '{print $1 "§" $3}' | sort -n > $listfile
cat $listfile | perl -F§ -lane 'if ($F[0]!=$lastsize){$cnt=1}else{$cnt+=1;if($cnt==2){print "\n",$lastall} print join(" ",@F); print "if diff \"",$lastname,"\" \"",$F[1],"\" >/d$
sh $cmpfile | grep are\ IDENTICAL$ > $resfile
cat $resfile | awk -F§ '{print $0 "\nrm -i \"" $2 "\""}'
rm -f $listfile
rm -f $cmpfile
rm -f $resfile

As we can see, the files are sorted by size first, independent of their name, and then their contents are compared if necessary. This minimizes the effort quite well. I have also experimented with computing checksums, e.g. MD5, but both files would still need to be read anyway, so I prefer the classical diff. A performance increase might also be some comparison of filename extensions. However, in some environments it might happen that the filename extension gets changed or just omitted, e.g. when saving some image files in a web browser. Thus, the solution proposed here is what I decided to use.

Now, assume we have a photo directory where the files IMG_6413.JPGIMG_6414.JPG and IMG_6476.JPG are identical. Having run the above script we get the following results:

#./IMG_6413.JPG and §./IMG_6414.JPG§ are IDENTICAL
rm -i "./IMG_6414.JPG"
#./IMG_6414.JPG and §./IMG_6476.JPG§ are IDENTICAL
rm -i "./IMG_6476.JPG"

The identical files are clearly visible and can be easily validated. Now, if we accept the generated proposal of using the rm command, we can just copy the result and paste it into the shell. For safety reasons the user is prompted before every removal operation.

For those who absolutely know what they are doing, the option “-i” can be removed. Copying and pasting the generated proposal would then remove the files straight away. Be careful with this.

Enjoy!

 

Hello world!

Hi everybody,

so this is my blog. It will deal with anything related to computing, especially things I’m particularly keen on: databases, various programming techniques, Linux and OS X usage and related stuff. Have fun!