'Gensim: How to load corpus from saved lda model?
When I saved my LdaModel lda_model.save('model')
, it saved 4 files:
model
model.expElogbeta.npy
model.id2word
model.state
I want to use pyLDAvis.gensim
to visualize the topics, which seems to need the model, corpus and dictionary. I was able to load the model and dictionary with:
lda_model = LdaModel.load('model')
dict = corpora.Dictionary.load('model.id2word')
Is it possible to load the corpus? How?
Solution 1:[1]
Sharing this here because it took me awhile to find out the answer to this as well. Note that dict
is not a valid name for a dictionary and we use lda_dict
instead.
# text array is a list of lists containing text you are analysing
# eg. text_array = [['volume', 'eventually', 'metric', 'rally'], ...]
# lda_dict is a gensim.corpora.Dictionary object
bow_corpus = [lda_dict.doc2bow(doc) for doc in text_array]
Solution 2:[2]
in the gensim python code, they said ignore expElogbeta and state file. It is possible to load the corpus, corpus is a set of list contain 2 numbers. It will be complex to load it out, I suggest load corpus from the origin text data and using id2word
Solution 3:[3]
Jireh answered correctly but it may be confusing how to load all the previous LDA files. I'm not sure why gensim saves the *.state and *.npy files (I'd appreciate insights in the comments). To reuse a previous LDA model you load the *.model and *.id2word files along with your original corpus.
For instance, if I have a dataframe of my documents in column 'docs' then you load that dataframe again as you will need it to recreate your corpus.
import pandas as pd
from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
from pyLDAvis import gensim_models
df = pd.read_csv('your_file.csv')
texts = df['docs'].values
You load your previously created dictionary as follows:
dictionary = corpora.Dictionary.load('your_file.id2word')
... and then create the corpus from the dictionary and your original texts (created from the dataframe['docs'] above):
corpus = [dictionary.doc2bow(text) for text in texts]
The previously created LDA model is loaded via gensim:
lda_model = gensim.models.ldamodel.LdaModel.load('your_file.model')
These objects are then fed into your pyLDAvis instance:
lda_viz = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
If you don't use the .id2word file you can run into issues with not having the correct shape (IndexError). I've had this happen when I ran LDA multicore so I use the .id2word rather than recreating the dictionary from the corpus.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jireh |
Solution 2 | Tu?n Nguy?n Hoàng Thanh |
Solution 3 | script_kitty |