# Levenshtein Fuzzy String Match

Previously I've posted about using the Dice Coefficient as a method to perform a fuzzy match on a string, producing a percentage as a result to measure the amount of similarity between two strings. The Dice Coefficient is a quick way to produce a measurement of similarity, but it does have some drawbacks. With small strings, the amount of similarity is small compared to larger strings. For example "cat" and "hat" come out with a rating of 50% using the Dice Coefficient. With longer strings, the "error" is smoothed. One benefit of the Dice Coefficient is that small runs of characters within the string (and by "small run", I really mean two characters given how the Dice Coefficient functions on bigrams) weigh into the overall rating of similarity. As good as the Dice Coefficient is, simple and fairly quick, it's not the tool for every job.

When dealing with small strings (and small is a relative measurement), the approach created by Vladimir Levenshtein in 1965 has a large benefit
over the Dice Coefficient approach. Amazing to realize that the 1965 wasn't a typo; this isn't a new idea or a new problem, but how many software
developers in the 21^{st} Century have heard of Levenshteins' algorithm? I'd guess only a small percentage! Yet this is a classic computer science
problem, and one that has applications outside of spell checking.

What Levenshteins' algorithm does is count the number of changes that must occur in one string to transform it into another string. Take the example of two strings, "George" and "Geordie"; how many characters would you have to change to transform "George" into "Geordie"? The answer is 2, here's how:

- George -> Georde (substitute the 'g' with 'd')
- Georde -> Geordie (insert 'i' between the 'd' and 'e')

Pretty easy stuff for a human, getting a machine to calculate the answer is another problem. Tracking and measuring the inserting is the really interesting part of the problem and this is what Levenshteins' algorithm does very nicely.

The actual algorithm itself is very simple, requiring a matrix to represent the values as the calculation progresses through both strings. A
simple version of the loops results in an O(n^{2}) implementation, but since only very small strings (under 10 characters typically) existed in
the problem space I was working within, this was not a huge performance drain.

One modification of the results of the algorithm that I made to produce the goal I had was to produce a percentage instead of a number. This is really only minor tweak to the results to transform the number of edits required into a percentage. To transform the results, I subtracted from 1.0 the number of edits required divided by the length of the longest string. Taking the resulting double value and multiplying by 100 yields a percentage; this step was only done to produce a result that would be meaningful to a user. Using the example of transforming "George" into "Geordie" the result would be: 1.0 * (2 / 7) or 71 (rounded down, remember the goal I had was to produce a fuzzy result and that 71 is close enough). Another example would be transforming "Fred" into "George", almost nothing is the same and 5 edits are required so the result would be 0, 1.0 * (5/6) or 16% as very little is the same and what is isn't in the same position.

Before getting into the loop to calculate the edit distance, check to see if either string is empty. If one is, return the length of the other string. This points out one interesting note about Levelshteins' algorithm, in that the result is always bounded by the length of the longest of the two input parameter strings.

If neither string is empty, then construct the matrix for the calculation. The matrix should be the length of the first input string plus one by the length of the second input string plus one (in the "George" vs. "Geordie" example above, the matrix would measure 7 by 8). Note the "plus one", as these rows are the "seed" values to kick everything off and allow a very simple comparison later. The first column and row will not be modified later as the loops start at the column and row indexed by one, not zero.

Fig 1: Initial Matrix

While looping through the strings, the indexes of both strings are the coordinates of the matrix. Note that the string indexes and the matrix will be off by one as the matrix coordinates start at one and, in the Java code below, the strings start at zero. The actual calculation is interesting as it looks at three cells, the cell to the left, the cell above and the cell to the upper left. The point of the three comparisons is to take the minimum value from any of the three cells, this can be represented simply in Java with a single line of code:

Math.min(Math.min(a, b), c);

Once the matrix is initialized, the remainder of the values may be filled in. Once the entire matrix is filled in, the answer is in the lower right hand cell.

For each comparison, either 0 or 1 is the result. If the two characters are the same, then the result is 0; if they are different, the result is 1. This is used in generating one of the inputs to the selection of the minimum value of the three candidates that will result in the value inserted into the next cell of the matrix. The result of comparing the 'G' to the 'G' would then be 0. The cell above has a value of 1, the cell to the left has a value of 1, and the value of the cell one up and one to the left plus the result of the comparison is 0. Feeding 1,1,0 into the three way minimum results in 0, which is then placed into the cell resulting in a matrix with these values:

Figure 2: Matrix after the first iteration

The result of the first pass would then be:

Figure 3: Matrix after the first pass

Letting the loops run to completion results in a matrix that looks like:

Figure 4: The completed matrix

And the result is 2, as predicted originally.

# The Java Code

public class Levenshtein { public Levenshtein() { super(); } public double compare(final String s1, final String s2) { double retval = 0.0; final int n = s1.length(); final int m = s2.length(); if (0 == n) { retval = m; } else if (0 == m) { retval = n; } else { retval = 1.0 - (compare(s1, n, s2, m) / (Math.max(n, m))); } return retval; } private double compare(final String s1, final int n, final String s2, final int m) { int matrix[][] = new int[n + 1][m + 1]; for (int i = 0; i <= n; i++) { matrix[i][0] = i; } for (int i = 0; i <= m; i++) { matrix[0][i] = i; } for (int i = 1; i <= n; i++) { int s1i = s1.codePointAt(i - 1); for (int j = 1; j <= m; j++) { int s2j = s2.codePointAt(j - 1); final int cost = s1i == s2j ? 0 : 1; matrix[i][j] = min3(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + cost); } } return matrix[n][m]; } private int min3(final int a, final int b, final int c) { return Math.min(Math.min(a, b), c); } }