'Python closest match between two string columns
I am looking to get the closest match between two columns of string
data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
Solution 1:[1]
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %
Solution 2:[2]
I think you need s.set_seqs(df1_str, best_match)
function instead of s.set_seq2(df1_str, best_match)
(docs)
Solution 3:[3]
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Solution 4:[4]
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Daniel |
Solution 2 | svfat |
Solution 3 | wogisha |
Solution 4 | Bob |