'Algorithm to find the most frequent words [closed]

I can't think of an algorithm for my code. Please help me out.

We have two files. The first one has text (Text length is n-words). The second one contains "word synonyms". That is, in one row will be written the initial form of the word, after its forms. For example

  • house, hours, how, housing

The length of this file is m words. The files are very long.

I need to find the 100 most frequent words in the first file. If we find the word computer and complete and comfy, it will be counted as one word (because they are in the same row in the second file) computer (initial form). But the counter of this word is already equal to three.

Help to find the fastest algorithm for the solution. And what time complexity will be based on n, m.



Solution 1:[1]

  • Parse the synonyms file into a dict that maps the synonyms to their main word.
  • Replace every word by its main synonym, if it has one, or keep it if it doesn't. If syn_dict is the dict of synonyms and word is a word, then syn_dict.get(word, word) will return word's main synonym if it's in the dict, or return word otherwise.
  • Use collections.Counter to count all the words.
  • Use method .most_common of Counter to extract the 100 most frequent words.

Emulating the text files with StringIO for reproducibility:

from io import StringIO
from collections import Counter

textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')

syn_dict = {}
for line in synonymfile:
    word, *synonyms = map(str.strip, line.split(','))
    for s in synonyms:
        syn_dict[s] = word

print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
#  'hours': 'house', 'how': 'house', 'housing': 'house'}

c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

With actual files it should look like this:

from collections import Counter

with open('file2.txt', 'r') as synonymfile:
    syn_dict = {}
    for line in synonymfile:
        word, *synonyms = map(str.strip, line.split(','))
        for s in synonyms:
            syn_dict[s] = word

with open('file1.txt', 'r') as textfile:
    c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Stef