Skip to main content


Showing posts from March, 2009

Finding Duplicate MP3s Using Locality Sensitive Hashing

I have a rather large collection of music that I've built up over the years. Properly organizing it and finding duplicates is time consuming. Over the weekend, I spent some time working on scripts to find duplicates. My first iteration is below. It's basically just an implementation of fdupes (apt-get install fdupes) that ignores the ID3 header in MP3 files. It uses Mutagen to handle the ID3 stuff.import hashlib
import os
import sys

import mutagen
from mutagen.id3 import ID3

def get_mp3_digest(path):
id3_size = ID3(path).size
fp = open(path) # Igore the ID3 header.
digest = hashlib.md5(
return digest

mp3s = {}
top = sys.argv[1]

for root, dirs, files in os.walk(top):
for f in files:
if f.endswith('.mp3'):
path = os.path.join(root, f)
digest = get_mp3_digest(path)
except mutagen.id3.error, e:
print >>sys.stderr, 'Error generating digest for %r.\n%r' % (path, e)
Read more