Levenshtein as a Fuzzy Match in Python
Previously, I have written about using the Levenshtein algorithm as a way to produce a fuzzy match between two strings in Java, and posted a Ruby implementation of Levenshtein to Guthub. Now it's Python's turn, but thankfully there is an excellent library for Levenshtein already written for Python. The only missing piece was that I wanted to have the result as a percentage instead of a count of edits between the two strings.
Before getting started using the python-Levenshtein, you'll need to install the dependency. For Ubuntu, this is accomplished with sudo apt-get install python-Levenshtein
. After the dependency is resolved, add import Levenshtein
to your Python code and then it's off to the races matching strings with Levenshtein.
The translation from number of edits to a percentage is the same I used in the Java version (http://georgestragand.com/levenshtein.html) and appears on line six of the full listing below, and is pseudo-code: 1.0 ( distance(str1, str2) / max(len(str1), len(str2) )
The data set used is the same as the Distribution of JSON Values With Python from the GunStockMarket and is available here.
Saving the code below into a file named levenshtein_example.py and the sample data into a file named GunStockMarket.json, it may be invoked with: python levenshtein_example.py < GunStockMarketSample.json ¦ more
The output should match (of course there is many more lines output if the program is allowed to execute for the entire data set):
0.82: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A Loaded 0.81: FS: Springfield Armory M1A 308::FS: Springfield Armory M1 GARAND 0.83: FS: Springfield Armory M1A 308::SPringfield Armory M1A 308 0.88: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A #9826 0.84: FS: Springfield Armory M1A 308::FS: Springfield Armory M1A #9222 0.81: FS/FT: M1 Garand::FS: M1 Garand 0.81: FS/FT: M1 Garand::FS: M1 Garand 0.81: FS/FT: M1 Garand::FS: M1 Garand 0.80: FS/FT: M1 Garand::FS/FT: H&R M1 Garand 0.81: FS/FT: M1 Garand::FS: M1 Garand 0.75: FS/FT: M1 Garand::FS: M1 garand 0.75: FS/FT: M1 Garand::FS: M1 garand 0.77: FS: Mosin Nagant M91/30::FS: Mosin Nagant M91/30 Soviet 0.81: FS: Mosin Nagant M91/30::FS/FT: mosin nagant M91/30 0.96: FS: Mosin Nagant M91/30::FS: Mosin nagant M91/30 0.96: FS: Mosin Nagant M91/30::FS: Mosin nagant M91/30 0.76: Soviet Mosin Nagant M91/30 762x54R::Mosin Nagant M91/30 7.62x54R 0.76: Soviet Mosin Nagant M91/30 762x54R::Mosin Nagant M91/30 7.62x54R 0.77: FS/FT: Mosin Nagant M38::FS/FT: mosin nagant M91/30
The Python code is:
import sys import json import Levenshtein def compare(first, second): return 1.0 - (float(Levenshtein.distance(first, second)) / float(max(len(first), len(second)))) titles = [] for line in sys.stdin: event = json.loads(line) titles.append(event["title"].encode('utf-8')) for title1 in titles: for title2 in titles: result = compare(title1, title2) if result >= 0.75 and title1 <> title2: print ("%0.2f" % result) + ": " + title1 + "::" + title2