'How tf-idf model handles unseen words during test-data?
I have read many blogs but was not satisfied with the answers, Suppose I train tf-idf model on few documents example:
" John like horror movie."
" Ryan watches dramatic movies"
------------so on ----------
I use this function:
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print((X_train_counts.todense()))
# Gives count of words in each document
But it doesn't tell which word? How to get words as headers in X_train_counts
outputs. Similarly in X_train_tfidf ?
So X_train_tfidf output will be matrix with tf-idf score:
Horror watch movie drama
doc1 score1 -- -----------
doc2 ------------------------
Is this correct?
What does fit
does and what does transformation
does?
In sklearn it is mentioned that:
fit(..) method to fit our estimator to the data and secondly the transform(..) method to transform our count-matrix to a tf-idf representation.
What does estimator to the data
means?
Now suppose new test document comes:
" Ron likes thriller movies"
How to convert this document to tf-idf? We can't convert it to tf-idf right?
How to handle word thriller
which is not there in train document.
Solution 1:[1]
taking two text as input
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = ["John like horror movie","Ryan watches dramatic movies"]
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())
o/p
dramatic horror john like movie movies ryan watches 0 0.000000 0.471078 0.471078 0.471078 0.471078 0.335176 0.000000 0.000000 1 0.363788 0.000000 0.000000 0.000000 0.000000 0.776515 0.363788 0.363788
Now testing it for new comment , we need to use transform function , the word which are out of vocabulary will get ignored while vectorizing it.
new_comment = ["ron don't like dramatic movie"]
pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new_comment)).todense(), columns = count_vect.get_feature_names())
dramatic horror john like movie movies ryan watches
0 0.57735 0.0 0.0 0.57735 0.57735 0.0 0.0 0.0
if you want to use vocabulary of certain word, than prepare list of word that you want to use , and keep appending new word to this list and pass list to CountVectorizer
vocabulary = ['dramatic', 'movie','horror']
vocabulary.append('Thriller')
count_vect = CountVectorizer(vocabulary = vocabulary)
cont_vect.fit_transform(text)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |