gitmyhub

thefuzz

Python ★ 3.6k updated 1y ago

Fuzzy String Matching in Python

TheFuzz is a Python library for fuzzy string matching, which means finding strings that are similar but not identical. It uses an algorithm called Levenshtein distance, which counts the minimum number of character insertions, deletions, or substitutions needed to transform one string into another. The closer two strings are by that measure, the higher the similarity score, expressed as a number between 0 and 100.

The library provides several comparison modes for different situations. A simple ratio comparison scores two strings by their overall character overlap. A partial ratio is useful when one string is a substring of the other, like matching a short search term against a longer title. Token sort ratio and token set ratio are designed for cases where word order differs or words repeat, such as comparing "New York Giants" to "Giants New York" or handling a name that appears twice in one string.

Beyond comparing two strings directly, the library includes a process module for searching a list of choices. You give it a query string and a list of candidates, and it returns the best matches along with their scores. This is useful for autocomplete, deduplication, or correcting typos in user input. The README shows an example of matching a partial team name against NFL team names and getting ranked results back.

Installation is via pip. The library requires Python 3.8 or newer and depends on rapidfuzz, a fast fuzzy matching library, for the underlying calculations. It originated at SeatGeek, a ticket marketplace, where matching inconsistently spelled venue and team names is a common data problem.