Skip to main content

Finding Duplicate MP3s Using Locality Sensitive Hashing

I have a rather large collection of music that I've built up over the years. Properly organizing it and finding duplicates is time consuming. Over the weekend, I spent some time working on scripts to find duplicates. My first iteration is below. It's basically just an implementation of fdupes (apt-get install fdupes) that ignores the ID3 header in MP3 files. It uses Mutagen to handle the ID3 stuff.
import hashlib
import os
import sys

import mutagen
from mutagen.id3 import ID3

def get_mp3_digest(path):
id3_size = ID3(path).size
fp = open(path) # Igore the ID3 header.
digest = hashlib.md5(
return digest

mp3s = {}
top = sys.argv[1]

for root, dirs, files in os.walk(top):
for f in files:
if f.endswith('.mp3'):
path = os.path.join(root, f)
digest = get_mp3_digest(path)
except mutagen.id3.error, e:
print >>sys.stderr, 'Error generating digest for %r.\n%r' % (path, e)
mp3s.setdefault(digest, []).append(path)
print >>sys.stderr, root

for digest, paths in mp3s.items():
if len(paths) > 1:
for path in paths:
print path
After that, I came across locality sensitive hashing (LSH) and decided it would be a nice way to speed up duplicate detection for non-exact duplicates. Here's my explanation of LSH in a nutshell. Pick some hash function that will result in collisions with similar values (i.e. concatenating a sampling of data from the value). For example, if your value was a list of four digits, pick the first two and concatenate them. That hash would have lots of collisions for similar values (any that share the same first two digits). Now, extend that hash function to include some randomness in the output. Instead of picking the first two integers in the list, pick a random two. If you hash the value many times, it creates a uniform distribution of hash keys based on the entire value you're hashing. So, similar values will often collide with each other. Once you've hashed all your values in this way, you can simply count up the collisions to find duplicates.

My script is below, but here's a basic explanation of how it works:
  • Convert the path into all lower case, strip all non-alpha characters, and concatenate it into one long string.
  • Pick a random sample of m letters from the resulting string and concatenate them into a hash key for the path.
  • Repeat n times per path.
  • For each path and for each generate hash key per path, count up the number of collisions with other paths in the hash table.
  • For any paths which collide more than t times, record the path as a duplicate.
This is a very fast process. If your duplicates have similar file names, you can quickly pare down the list of files to inspect further for better duplicate detection. I use the percent difference in audio length and then sort by bit rate quality. Here's the what I ended up with:
import os
import random
import re
import sys

import mutagen
from mutagen.mp3 import MP3

NUM_CHARS_TO_PICK = 50 # Number of chars to pick from each path (sample size).
NUM_BUCKETS_PER_PATH = 50 # Number of samples to pick from each path.
MATCH_THRESHOLD = 5 # Number of collisions required to indicate a duplicate.
PERCENT_DIFFERENCE_THRESHOLD = 0.025 # Threshold for matching length.

NONALPHA_RE = re.compile(r'[^a-z]')

def fuzzy_compare_mp3_length(path, other_path):
audio = MP3(path)
other_audio = MP3(other_path)
except mutagen.mp3.error, e:
logging.error('Error comparing quality of %r and %r.\n%r' %
(path, other_path, e))
return False
pd = abs( -
pd /= ( + / 2.0

def compare_mp3_quality(path, other_path):
audio = MP3(path)
other_audio = MP3(other_path)
except mutagen.mp3.error, e:
logging.error('Error comparing quality of %r and %r.\n%r' %
(path, other_path, e))
return 0
return -

def hash(s):
indexes = random.sample(xrange(len(s)), NUM_CHARS_TO_PICK)
return ''.join(s[i] for i in indexes)

buckets = {}
paths = {}

# First we fill buckets using a LSH.
for top in sys.argv[1:]:
for root, dirs, files in os.walk(top):
for f in files:
path = os.path.join(root, f)
name, ext = os.path.splitext(path[len(top):])
if ext != '.mp3':
letters = NONALPHA_RE.sub('', name.lower())
# We need to pad any strings that are too short.
letters += 'x' * (NUM_CHARS_TO_PICK - len(letters))
for n in range(NUM_BUCKETS_PER_PATH):
bucket_key = hash(letters)
buckets.setdefault(bucket_key, []).append(path)
paths.setdefault(path, []).append(bucket_key)
print >>sys.stderr, root

checked = []
for path, path_bucket_keys in paths.iteritems():
# Skip already checked paths. We assume that duplicate detection is
# communitive. If dupe(A, B) and dupe(B, C) then dupe(A, C).
if path in checked:
# Count up all the bucket collisions by path. This count includes our
# current path (it will exist once in all its own buckets)
collisions = {}
for bucket_key in path_bucket_keys:
for other_path in buckets[bucket_key]:
collisions[other_path] = collisions.get(other_path, 0) + 1
# All paths that collided more times than the threshold are duplicates.
dupes = set()
for other_path, count in collisions.iteritems():
# If we have dupes based on path similarity, sort them by quality and remove
# any dupes with significantly different lengths.
if len(dupes) > 1:
dupes = sorted(dupes, cmp=compare_mp3_quality, reverse=True)
for dupe in dupes[1:]:
if not fuzzy_compare_mp3_length(dupes[0], dupe):
# If we still have dupes, output them to the rm script.
if len(dupes) > 1:
print '# rm %r' % dupes[0]
for dupe in dupes[1:]:
print 'rm %r' % dupe
It was very snappy and worked quite well for me. There are several natural extensions of this that include looking at ID3 tags. It's possible I'll get around to that as well :)

Popular posts from this blog

Email Injection

Not so long ago, I ran a wiki called SecurePHP. On that wiki, there was one particular article about email injection that received a lot of attention. Naturally, with all the attention came lots of spam. As a result, I disabled editing of the wiki and content stagnated. Still, the email injection article remained popular. About a year later, the server that hosted SecurePHP died and I never had a chance to hook it all back up. I saved the article though and I'm reposting it now. It may be a bit old (I've been away from PHP for a long time), and I didn't write all of it, so feel free to leave comments about needed updates and corrections. Though this article focuses on PHP, it provides a lot of general information regarding email injection attacks. The PHP mail() Function There are a lot of ways to send anonymous emails, some use it to mass mail, some use it to spoof identity, and some (a few) use it to send email anonymously. Usually a web mailform using the mail() funct
Read more

Bot Commander r1 Released

I just published Bot Commander , the code for my Lego NXT rover . There's a lot left to be done, but release early and often, right? Currently it provides a UI for controlling the direction and speed of all three motor ports on the NXT brick. You can link motors together to adjust their speed in unison. In addition, you can enable "Tilt Control" for a steering-wheel-type experience. To use tilt control: Hook up motor A and B to be the left and right wheels of your vehicle. Hold the phone sideways (i.e. landscape). Tilt the phone forward and backward to drive forward and backward. Turn the phone right and left (like a steering wheel) to steer right and left. As you tilt the phone, you'll see the UI update the slider controls for the speed of motors A and B. I plan to expand the UI to provide a lot more than just motor control. Before that, though, I'll push a JAR to make it easy to integrate control of Lego NXT robots into your own Android project. The code
Read more

Android Recipes and Snippets

I've put together a small collection of Android recipes. For each of these recipes, this is an instance of Context (more specifically, Activity or Service ) unless otherwise noted. Enjoy :) Intents One of the coolest things about Android is Intents . The two most common uses of Intents are starting an Activity (open an email, contact, etc.) and starting an Activity for a result (scan a barcode, take a picture to attach to an email, etc.). Intents are specified primarily using action strings and URIs. Here are some things you can do with the android.intent.action.VIEW action and startActivity() . Intent intent = new Intent(Intent.ACTION_VIEW); // Choose a value for uri from the following. // Search Google Maps: geo:0,0?q=query // Show contacts: content://contacts/people // Show a URL: intent.setData(Uri.parse(uri)); intent.setFlags(Intent.FLAG_ACTIVITY_NEW_TASK); startActivity(intent); Other useful action/URI pairs include: Intent.ACTION_DIAL , tel://867530
Read more