'Most efficient string similarity metric function
I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher
from the standard lib.ratio()
gives good results, but takes >100ms for 10kb text.quick_ratio()
takes only half the time, but the results are sometimes far of the real value.python-Levenshtein
: levenshtein is an acceptable metric for my use case, butLevenshtein.ratio('foo', 'bar')
is not faster than theSequenceMatcher
.
Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.
Solution 1:[1]
edlib
seems to be fast enough for my use case.
It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher
.
Solution 2:[2]
I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.
Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | klamann |
Solution 2 | Danferno |