'Word count Matrix of document corpus with Pandas Dataframe
Well, I have a corpus of 2000+ text documents and I'm trying to make a matrix with pandas dataframe in the most elegant way. The matrix would look like this:
df=pd.DataFrame(index=['Doc1_name','Doc2_name','Doc3_name','...','Doc2000_name']
, columns=['word1','word2','word3','...','word50956'])
df.iloc[:,:] = 'count_word'
print(df)
I already have all the document in full-text in a list called "texts" I don't know if my question is clear enough.
Solution 1:[1]
Use sklearn's CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'texts': ["This is one text (the first one)",
"This is the second text",
"And, finally, a third text"
]})
cv = CountVectorizer()
cv.fit(df['texts'])
results = cv.transform(df['texts'])
print(results.shape) # Sparse matrix, (3, 9)
If the corpus is small enough to fit in your memory (and 2000+ is small enough), you can convert the sparse matrix into a pandas dataframe as follow:
features = cv.get_feature_names()
df_res = pd.DataFrame(results.toarray(), columns=features)
df_res
is the result you want:
df_res
index and finally first is one second text the third this
0 0 0 1 1 2 0 1 1 0 1
1 0 0 0 1 0 1 1 1 0 1
2 1 1 0 0 0 0 1 0 1 0
If case you get a MemoryError
, you can reduce the vocabulary of words to consider using different parameters of CountVectorizer
:
- Set parameter
stop_words='english'
to ignore english stopwords (likethe
and `and) - Use
min_df
andmax_df
, which makesCountVectorizer
ignore some words based on document frequency (too frequent or infrequent words, which may be useless) - Use
max_features
, to use only the most commonn
words.
Solution 2:[2]
For any not-small corpus of text I would strongly recommend using scikit-learn
's CountVectorizer.
It's as simple as:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus) # list of documents (as strings)
It doesn't exactly give you the dataframe in your desired structure, but it shouldn't be hard to construct it using the vocabulary_
attribute of count_vectorizer
, which contains the mapping of the term to its index in the result matrix.
Solution 3:[3]
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(df['texts'])
# covert doc term matrix to array
df_vector = pd.DataFrame(doc_term_matrix.toarray())
df_vector.columns = count_vect.get_feature_names()
df_vector.head()
Solution 4:[4]
def create_doc_term_matrix(text,vectorizer):
doc_term_matrix = vectorizer.fit_transform(text)
return pd.DataFrame(doc_term_matrix.toarray(), columns =
vectorizer.get_feature_names())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | andersource |
Solution 3 | DSBLR |
Solution 4 | Nara Ramezani |