'Most efficient string similarity metric function

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).

I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:

  • difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
  • python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.

Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.



Solution 1:[1]

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

Solution 2:[2]

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.

Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 klamann
Solution 2 Danferno