'top terms in corpus gensim

I am using python package Gensim for clustering, I first created a dictionary from tokenizing and lemmatizing sentences of the given text and then using this dictionary created corpus using following code:

 mydict = corpora.Dictionary(LemWords)
 corpus = [mydict.doc2bow(text) for text in LemWords]

I understand corpus would contain id of the words along with their frequency in each document. I wish to know the frequency of a given word in the whole corpus to find top terms in the corpus. I am wondering if there is any method available that return frequency of the term in the entire corpus



Solution 1:[1]

You can try this:

import itertools
from collections import defaultdict

total_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_count[word_id] += word_count

# Top ten words
sorted(total_count.items(), key=lambda x: x[1], reverse=True)[:10]

Solution 2:[2]

Following your code:

 mydict = corpora.Dictionary(LemWords)
 corpus = [mydict.doc2bow(text) for text in LemWords]
    
 # word frequency by doc showing the word, if you want
 wordfreq_doc = [{mydict[idw]: freq for idw, freq in cp}
                 for cp in corpus]

 # word frequency for corpus
 wordfreq_all = Counter()
 for fwd in freq_w_doc: wordfreq_all.update(fwd)
 wordfreq_all = wordfreq_all.most_common()

I use both. The first one is to concatenate with my dict data frame. Then, I can see if LSA is working well, for instance. The second, one I use it to find stop words and the text balance.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 KRKirov
Solution 2 Eduardo Freitas