Skip to main content

Posts

Showing posts from March, 2009

Finding Duplicate MP3s Using Locality Sensitive Hashing

I have a rather large collection of music that I've built up over the years. Properly organizing it and finding duplicates is time consuming. Over the weekend, I spent some time working on scripts to find duplicates. My first iteration is below. It's basically just an implementation of fdupes (apt-get install fdupes) that ignores the ID3 header in MP3 files. It uses Mutagen to handle the ID3 stuff.import hashlib
import os
import sys

import mutagen
from mutagen.id3 import ID3


def get_mp3_digest(path):
id3_size = ID3(path).size
fp = open(path)
fp.seek(id3_size) # Igore the ID3 header.
digest = hashlib.md5(fp.read()).hexdigest()
fp.close()
return digest


mp3s = {}
top = sys.argv[1]

for root, dirs, files in os.walk(top):
for f in files:
if f.endswith('.mp3'):
path = os.path.join(root, f)
try:
digest = get_mp3_digest(path)
except mutagen.id3.error, e:
print >>sys.stderr, 'Error generating digest for %r.\n%r' % (path, e)
else…
Read more