FuzzyWuzzy: How to Measure String Distance on Python

FuzzyWuzzy: How to Measure String Distance on Python

Tell someone, tell everyone!

Python’s FuzzyWuzzy library is used for measuring the similarity between two strings. With this tutorial, you can learn all about it and start using it too.

Sometimes, we need to see whether two strings are the same. When comparing an entered password’s hash to the one stored in your login database, ‘similarity’ just won’t cut it.

Other times, however, things can get a bit… fuzzier.

If my customer’s name is Albert Thompson, but he pays with a Credit Card under the name of Albert G. Thompson, should I be calling the police to report fraud? Should “The Lord of the Rings II: The Two Towers” and “The Lord of the Rings 2: the 2 Towers” be treated as two completely separate books by a website? Are Austria and Australia really two different countries?

Okay, I may have gotten carried away with that last one, but you get the idea.

String distance measures

What we want is some function that measures how similar two strings are, but is robust to small changes. This problem is as common as it sounds: scientists have been coming up with solutions to it for a long while.

Jaccard Distance: a first approach

One of the most intuitive ones is the Jaccard distance. It can be generalized to a distance measure for any two sets. It has the following formula:

 d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }.
Jaccard’s set distance formula

That is, how many elements are on either set, but not shared by both, divided by the total count of distinct elements.

For instance, given the strings “Albert” and “Alberto”, it will report a similarity of 85.7%, since they share 6 letters out of a total of 7.

However, this is not a measure specifically tailored for strings.

It will fail in many use-cases, since it doesn’t really take ordering into account. For example two anagrams, like “rail safety” and “fairy tales”, will always have a 100% match, even if those strings are quite different.

Levenshtein Distance

Invented by the Russian Scientist Vladimir Levenshtein in the ’60s, this measure is a bit more intuitive: it counts how many substitutions are needed, given a string u, to transform it into v.

For this method, a substitution is defined as:

  • Erasing a character.
  • Adding one.
  • Replacing a character with another one.

The minimum amount of these operations that need to be done to u in order to turn it into v, correspond to the Levenshtein distance between those two strings.

It can be obtained recursively with this formula:

{\displaystyle \qquad \operatorname {lev} _{a,b}(i,j)={\begin{cases}\max(i,j)&{\text{ if }}\min(i,j)=0,\\\min {\begin{cases}\operatorname {lev} _{a,b}(i-1,j)+1\\\operatorname {lev} _{a,b}(i,j-1)+1\\\operatorname {lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\end{cases}}&{\text{ otherwise.}}\end{cases}}}
Levenshtein’s string distance formula

Where i and j are indexes to the last character of the substring we’ll be comparing. The second term in the last expression is equal to 1 if those characters are different, and 0 if they’re the same.

This is the measure Python’s FuzzyWuzzy library uses.

Using FuzzyWuzzy in Python

First of all, we will have to install the library. To do this, all we have to do is run the following command:

pip install fuzzywuzzy

After that’s done, to obtain the similarity ratio between two strings we will need these two lines:

from fuzzywuzzy import fuzz
similarity = fuzz.ratio("hello","world")

You probably noticed I said ratio. The ratio method will always return a number between 0 and 100 (yeah, I’d have preferred it to be between 0 and 1, or call it a percentage, but to each their own).

It can be shown that the Levenshtein distance is at most the length of the longest string: replace all characters in the shorter one with the first part of the longer one, and then add the remaining ones.

In order to return a ratio, we can normalize the distance, by dividing it by the longest string’s length. Otherwise, the number would fluctuate enormously as we give it inputs with different sizes.

That’s what FuzzyWuzzy’s ratio method does, except it then multiplies that number by 100.

This ratio solves some of the previously mentioned problems:

fuzz.ratio("Albert Thompson", "Albert G. Thompson") #91%

fuzz.ratio("The Lord of the Rings II: The Two Towers",
"The Lord of the Rings 2: the 2 Towers") #88%

Even if it may bring a few new ones:

#88% for two different countries
fuzz.ratio("Austria","Australia")

#57% but it's the same country
fuzz.ratio("Czechia","Czech Republic")

Other FuzzyWuzzy methods

Python’s FuzzyWuzzy library provides us not only with the vanilla Levenshtein distance, but also with a few other methods we can make use of.

partial_ratio

The partial_ratio method calculates the FuzzyWuzzy ratio for all substrings of the longer string with the length of the shorter one, and then returns the highest match.

For instance,

fuzz.partial_ratio("abc","a") == 
min([fuzz.ratio( char, "a") for char in "abc"])

This has some interesting effects:

fuzz.partial_ratio("Thomas and His Friends", "Thomas") #100%
fuzz.partial_ratio("Batman vs Superman", "Batman") #100%

Effectively, the partial_ratio method could be a fuzzy replacement to the contains string method, just as the regular ratio may replace the equals method.

However, it will fail for strings that are similar, but whose words appear in a different order. Even a slight order change will break it.

#72% with basically the same idea
fuzz.partial_ratio("Peanut Butter and Jelly",
"Jelly and Peanut Butter")

#86% with a random (carefully selected) string
fuzz.partial_ratio("Peanut Butter and Jelly",
"Otter and Hell")

token_sort_ratio

The Token Sort Ratio divides both strings into words, then joins those again alphanumerically, before calling the regular ratio on them.

This means:

fuzz.partial_ratio("Batman vs Superman", 
"Superman vs Batman") #100%

fuzz.partial_ratio("a b c", "c b a") #100%

token_set_ratio

The Token Set Ratio separates each string into words, turns both lists into sets (discarding repeated words) and then sorts those before doing the ratio.

This way not only do we rule out shared words, we also account for repetitions.

fuzz.token_set_ratio("fun","fun fun fun") #100%

fuzz.token_set_ratio("Lord the Rings of",
"Lord of the Rings") #100%

Conclusions

Python’s FuzzyWuzzy library can be a very useful tool to have under your belt. Both for customer’s names matching, or acting as a poor man’s word embedding, it can save you a lot of trouble or help with your Machine Learning model’s feature engineering.

However, it requires preprocessing (like turning both strings to lowercase) and doesn’t take synonyms into account.

Due to this, it may not be the best solution for cases where Natural Language Processing techniques, such as word embeddings with Clustering algorithms, may be needed.

I hope you’ve found this article helpful, and let me know if you find another use for FuzzyWuzzy in your job!

Further Reading

  • For a different Natural Language Processing topic, check out my LSTM (text generation with Deep Learning) tutorial.
  • For a more fun experiment, see my Markov Chains for Text Generation article, where I explain how they work and try them out on Game of Throne’s corpus.
  • If you want to learn about another cool Python framework, read my ScraPy for Web Scraping tutorial, which includes hands-on examples (we crawl Reddit for pictures of kittens!).
  • If you’re interested in Python’s good practices and syntax, with an emphasis on performance, look at my Python List Comprehensions tutorial, where we will learn those things while benchmarking every step.

Follow me on Twitter to stay up to date with more Python tutorials, tips and tricks.

If you want to become a Data Scientist, or grow faster in your Machine Learning career, check out the best data science books guide.


Get my secret Machine Learning recipes for success!

Tell someone, tell everyone!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

shares
Optimized with PageSpeed Ninja