Levenshtein as a Fuzzy Match in Python

Previously, I have written about using the Levenshtein algorithm as a way to produce a fuzzy match between two strings in Java, and posted a Ruby implementation of Levenshtein to Guthub. Now it's Python's turn, but thankfully there is an excellent library for Levenshtein already written for Python. The only missing piece was that I wanted to have the result as a percentage instead of a count of edits between the two strings.

Before getting started using the python-Levenshtein, you'll need to install the dependency. For Ubuntu, this is accomplished with sudo apt-get install python-Levenshtein. After the dependency is resolved, add import Levenshtein to your Python code and then it's off to the races matching strings with Levenshtein.

The translation from number of edits to a percentage is the same I used in the Java version (http://georgestragand.com/levenshtein.html) and appears on line six of the full listing below, and is pseudo-code: 1.0 ( distance(str1, str2) / max(len(str1), len(str2) )

The data set used is the same as the Distribution of JSON Values With Python from the GunStockMarket and is available here.

Saving the code below into a file named levenshtein_example.py and the sample data into a file named GunStockMarket.json, it may be invoked with: python levenshtein_example.py < GunStockMarketSample.json ¦ more

The output should match (of course there is many more lines output if the program is allowed to execute for the entire data set):

0.82: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A Loaded
0.81: FS: Springfield Armory M1A 308::FS: Springfield Armory M1 GARAND
0.83: FS: Springfield Armory M1A 308::SPringfield Armory M1A 308
0.88: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A #9826
0.84: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A #9222
0.81: FS/FT: M1 Garand::FS: M1 Garand
0.81: FS/FT: M1 Garand::FS: M1 Garand
0.81: FS/FT: M1 Garand::FS: M1 Garand
0.80: FS/FT: M1 Garand::FS/FT: H&R M1 Garand
0.81: FS/FT: M1 Garand::FS: M1 Garand
0.75: FS/FT: M1 Garand::FS: M1 garand
0.75: FS/FT: M1 Garand::FS: M1 garand
0.77: FS: Mosin Nagant M91/30::FS: Mosin Nagant M91/30 Soviet
0.81: FS: Mosin Nagant M91/30::FS/FT: mosin nagant M91/30
0.96: FS: Mosin Nagant M91/30::FS: Mosin nagant M91/30
0.96: FS: Mosin Nagant M91/30::FS: Mosin nagant M91/30
0.76: Soviet Mosin Nagant M91/30 762x54R::Mosin Nagant M91/30 7.62x54R
0.76: Soviet Mosin Nagant M91/30 762x54R::Mosin Nagant M91/30 7.62x54R
0.77: FS/FT: Mosin Nagant M38::FS/FT: mosin nagant M91/30

The Python code is:

import sys
import json
import Levenshtein

def compare(first, second):
    return 1.0 - (float(Levenshtein.distance(first, second)) / float(max(len(first), len(second))))

titles = []
for line in sys.stdin:
    event = json.loads(line)
    titles.append(event["title"].encode('utf-8'))
for title1 in titles:
    for title2 in titles:
        result = compare(title1, title2)
        if result >= 0.75 and title1 <> title2:
                print ("%0.2f" % result) + ": " + title1 + "::" + title2