'top terms in corpus gensim
I am using python package Gensim for clustering, I first created a dictionary from tokenizing and lemmatizing sentences of the given text and then using this dictionary created corpus using following code:
mydict = corpora.Dictionary(LemWords)
corpus = [mydict.doc2bow(text) for text in LemWords]
I understand corpus would contain id of the words along with their frequency in each document. I wish to know the frequency of a given word in the whole corpus to find top terms in the corpus. I am wondering if there is any method available that return frequency of the term in the entire corpus
Solution 1:[1]
You can try this:
import itertools
from collections import defaultdict
total_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
total_count[word_id] += word_count
# Top ten words
sorted(total_count.items(), key=lambda x: x[1], reverse=True)[:10]
Solution 2:[2]
Following your code:
mydict = corpora.Dictionary(LemWords)
corpus = [mydict.doc2bow(text) for text in LemWords]
# word frequency by doc showing the word, if you want
wordfreq_doc = [{mydict[idw]: freq for idw, freq in cp}
for cp in corpus]
# word frequency for corpus
wordfreq_all = Counter()
for fwd in freq_w_doc: wordfreq_all.update(fwd)
wordfreq_all = wordfreq_all.most_common()
I use both. The first one is to concatenate with my dict data frame. Then, I can see if LSA is working well, for instance. The second, one I use it to find stop words and the text balance.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | KRKirov |
Solution 2 | Eduardo Freitas |