'How to fetch vectors for a word list with Word2Vec?
I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?
I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.
Solution 1:[1]
The direct access model[word]
is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word]
.
Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab
, where model
is the variable name of your Word2Vec
object.
Thus, to create a dictionary object, you may:
my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
my_dict[key] = model.wv[key]
# Or my_dict[key] = model.wv.get_vector(key)
# Or my_dict[key] = model.wv.word_vec(key, use_norm=False)
Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt
. Your filename.txt will look like:
{'one': array([-0.06590105, 0.01573388, 0.00682817, 0.53970253, -0.20303348,
-0.24792041, 0.08682659, -0.45504045, 0.89248925, 0.0655603 ,
......
-0.8175681 , 0.27659689, 0.22305458, 0.39095637, 0.43375066,
0.36215973, 0.4040089 , -0.72396156, 0.3385369 , -0.600869 ],
dtype=float32),
'two': array([ 0.04694849, 0.13303463, -0.12208422, 0.02010536, 0.05969441,
-0.04734801, -0.08465996, 0.10344813, 0.03990637, 0.07126121,
......
0.31673026, 0.22282903, -0.18084198, -0.07555179, 0.22873943,
-0.72985399, -0.05103955, -0.10911274, -0.27275378, 0.01439812],
dtype=float32),
'three': array([-0.21048863, 0.4945509 , -0.15050395, -0.29089224, -0.29454648,
0.3420335 , -0.3419629 , 0.87303966, 0.21656844, -0.07530259,
......
-0.80034876, 0.02006451, 0.5299498 , -0.6286509 , -0.6182588 ,
-1.0569025 , 0.4557548 , 0.4697938 , 0.8928275 , -0.7877308 ],
dtype=float32),
'four': ......
}
You may also wish to look into the native save / load methods of Gensim's word2vec.
Solution 2:[2]
Gensim tutorial explains it very clearly.
First, you should create word2vec model - either by training it on text, e.g.
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
or by loading pre-trained model (you can find them here, for example).
Then iterate over all your words and check for their vectors in the model:
for word in words:
vector = model[word]
Having that, just write word and vector formatted as you want.
Solution 3:[3]
If you are willing to use python
with gensim
package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this
from gensim.models import Word2Vec
# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]
# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)
# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()
# Make a dictionary
we_dict = {word:model.wv[word] for word in words}
Solution 4:[4]
You can Directly get the vectors through
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors
and words through
model.wv.vocab.keys()
Hope it helps !
Solution 5:[5]
Using basic python:
all_vectors = []
for index, vector in enumerate(model.wv.vectors):
vector_object = {}
vector_object[list(model.wv.vocab.keys())[index]] = vector
all_vectors.append(vector_object)
Solution 6:[6]
For gensim 4.0:
my_dict = dict({})
for word in word_list:
my_dict[word] = model.wv.get_vector('0', norm = True)
Solution 7:[7]
Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector
Get the ordered list of words in the vocabulary
words = list(w for w in model.wv.index_to_key)
Get the vector for 'also'
print(model.wv['also'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Nikita Astrakhantsev |
Solution 3 | |
Solution 4 | Wickkiey |
Solution 5 | TrickOrTreat |
Solution 6 | keramat |
Solution 7 | Homa |