Skip to main content


Showing posts from March, 2009

Finding Duplicate MP3s Using Locality Sensitive Hashing

I have a rather large collection of music that I've built up over the years. Properly organizing it and finding duplicates is time consuming. Over the weekend, I spent some time working on scripts to find duplicates. My first iteration is below. It's basically just an implementation of fdupes ( apt-get install fdupes ) that ignores the ID3 header in MP3 files. It uses Mutagen to handle the ID3 stuff. import hashlib import os import sys import mutagen from mutagen.id3 import ID3 def get_mp3_digest(path): id3_size = ID3(path).size fp = open(path) # Igore the ID3 header. digest = hashlib.md5( fp.close() return digest mp3s = {} top = sys.argv[1] for root, dirs, files in os.walk(top): for f in files: if f.endswith('.mp3'): path = os.path.join(root, f) try: digest = get_mp3_digest(path) except mutagen.id3.error, e: print >>sys.stderr, 'Error generating digest for %r.\n%r
Read more