'Finding most similar string (formed by two or more words) in a text in Python
Let's say I have the string st="red-winged cormorant"
and the following text:
text=""""I have in the past assisted teams at Milford Point and Lighthouse Point.
The closest I've come to an independent Big Sit came many years ago when I teamed up with Luke Tiller,
who wanted to try a circle at Pleasure Beach in Bridgeport, CT.
We had high hopes for the location, which was a narrow barrier beach between marsh and Long Island Sound,
but despite excellent migration weather the count was a dud.
I hadn't given a new circle much thought until October 16, 2019. I was scanning the Great Island
marsh in Old Lyme from the observation deck in hopes of adding Red Cormorant. """
My objective is to get the most similar string to "red cormorant" in the text with the difflib
library.
Since the function get_close_matches
only allows comparing the string with a list of strings I have to split the text:
(Note: I make sure to remove all numbers, punctuation and line breaks)
import string
text=text.translate(str.maketrans("","",string.punctuation)).translate(str.maketrans("","",string.digits)).translate(str.maketrans("","","\n")).lower().split()
If I was seeking the word "cormorant", I'd use the following code:
import difflib
match=difflib.get_close_matches("cormorant",text, cutoff=0.8)
But in my case the input is "red-winged cormorant"
or -if I splited it- ["red-winged", "cormorant"]
. Thus, the output should be ["red", "cormorant"]
, since it is the most similar combination of strings and the order matters.
I now that using regular expressions could be a good solution. However, I'd like to now if difflib could work in this case.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|