'Gensim: How to load corpus from saved lda model?

When I saved my LdaModel lda_model.save('model'), it saved 4 files:

  1. model
  2. model.expElogbeta.npy
  3. model.id2word
  4. model.state

I want to use pyLDAvis.gensim to visualize the topics, which seems to need the model, corpus and dictionary. I was able to load the model and dictionary with:

lda_model = LdaModel.load('model')
dict = corpora.Dictionary.load('model.id2word')

Is it possible to load the corpus? How?



Solution 1:[1]

Sharing this here because it took me awhile to find out the answer to this as well. Note that dict is not a valid name for a dictionary and we use lda_dict instead.

# text array is a list of lists containing text you are analysing
# eg. text_array = [['volume', 'eventually', 'metric', 'rally'], ...]
# lda_dict is a gensim.corpora.Dictionary object

bow_corpus = [lda_dict.doc2bow(doc) for doc in text_array]

Solution 2:[2]

in the gensim python code, they said ignore expElogbeta and state file. It is possible to load the corpus, corpus is a set of list contain 2 numbers. It will be complex to load it out, I suggest load corpus from the origin text data and using id2word

Solution 3:[3]

Jireh answered correctly but it may be confusing how to load all the previous LDA files. I'm not sure why gensim saves the *.state and *.npy files (I'd appreciate insights in the comments). To reuse a previous LDA model you load the *.model and *.id2word files along with your original corpus.

For instance, if I have a dataframe of my documents in column 'docs' then you load that dataframe again as you will need it to recreate your corpus.

import pandas as pd
from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
from pyLDAvis import gensim_models

df = pd.read_csv('your_file.csv')
texts = df['docs'].values

You load your previously created dictionary as follows:

dictionary = corpora.Dictionary.load('your_file.id2word')

... and then create the corpus from the dictionary and your original texts (created from the dataframe['docs'] above):

corpus = [dictionary.doc2bow(text) for text in texts]

The previously created LDA model is loaded via gensim:

lda_model = gensim.models.ldamodel.LdaModel.load('your_file.model')

These objects are then fed into your pyLDAvis instance:

lda_viz = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

If you don't use the .id2word file you can run into issues with not having the correct shape (IndexError). I've had this happen when I ran LDA multicore so I use the .id2word rather than recreating the dictionary from the corpus.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jireh
Solution 2 Tu?n Nguy?n Hoàng Thanh
Solution 3 script_kitty