Simple Fuzzy String Similarity in Java

Java provides an equals function on the java.lang.String class, it works fine and there is no need to mess with it. But what if you need to know if one string is "close" to another but not equals? Java does not have a built in "closeness" measurement. Close has always counted in the game of horseshoes, and it could be argued that close will count with hand grenades or nuclear warfare, but out of the box programming languages do not come with a closeness measurement as it's either equal or not equal (OK, leave out equalsIgnoreCase because upper/lower case is not the measurement discussed here); there is no way to say "it's 75% the same" with stock methods. This absence is noteworthy, especially since Java comes with toys like regular expressions (java.util.regex.Pattern, etc.). If you need a fuzzy string match, you're up the creek.

Thankfully, there has been a ton of development in this area and the approaches are well documented. Bridging the gap from a formula to code can be interesting. Applying a simple concept from fuzzy logic where 1.0 is true and 0.0 is false, it's possible to translate that concept to 1.0 meaning the strings are exactly equal (equality expressed as the same bigram set, explained below) and 0.0 meaning the strings have nothing in common.

One very simple way to accomplish a measurement of string closeness is to use Dice's coefficient. The formula is:

Dice fourmula

One additional idea is the concept of breaking a string down to bigrams. A bigram is two adjacent characters from a string. Taking the string "fubar", the set of bigrams would be {"fu", "ub", "ba", "ar"}. Likewise, "foobar" would break down into {"fo", "oo", "ob", "ba", "ar"}. With either "fubar" or "foobar" you are still up the creek without a paddle, but are they almost the same? Now that the bigrams have been created, the formula can be filled in.

A quick review of the symbols in the formula may be handy. Remember that means the cardinality of the set X. If X = {"fu", "ub", "ba", "ar"}, then |X| = 4. The symbol between is the intersection of the sets on the left and right, resulting in a set of common members in each original set. In the classic Venn diagram of two circles, the union represents the common area in both circles.

The intersection of the bigram set from "fubar" and "foobar" is: {"ba", "ar"}. Only two elements exist in this set, which gives 4/9 as the measurement of similarity between "fubar" and "foobar". Where did the 9 come from? There are 4 elements in the bigram set of "fubar" and 5 elements in the bigram set of "foobar".

A simple (remember, simple, not efficient or concerned about Unicode) bigram generator could be represented with a method such as:

public List<char[]> bigram(String input)
{
	ArrayList<char[]> bigram = new ArrayList<char[]>();
	for (int i = 0; i < input.length() - 1; i++)
	{
		char[] chars = new char[2];
		chars[0] = input.charAt(i);
		chars[1] = input.charAt(i+1);
		bigram.add(chars);
	}
	return bigram;
}

Given that there is a method to convert a String to a bigram set, a simple method to produce the measurement of similarity could be:

public double dice(List<char[]> bigram1, List<char[]> bigram2)
{
	List<char[]> copy = new ArrayList<char[]>(bigram2);
	int matches = 0;
	for (int I = bigram1.size(); --i >= 0;)
	{
		char[] bigram = bigram1.get(i);
		for (int j = copy.size(); --j >= 0;)
		{
			char[] toMatch = copy.get(j);
			if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1])
			{
				copy.remove(j);
				matches += 2;
				break;
			}
		}
	}
	return (double) matches / (bigram1.size() + bigram2.size());
}

The above code is a very basic and straightforward implementation; the only tricks are to make a copy of one of the input bigrams to not modify what is passed in so that as matches are made the bigram is removed from any further matching, and the increment of the local variable match by two each time a match is made instead of multiplying the local variable match by 2 at the end. The second part (increment of match) was just done in the example to see if anyone was paying attention.

Those methods above, bigram and dice, provide a suggestion to implement Dice's coefficient in Java to create a simple measurement of a fuzzy string similarity.