'How to fetch vectors for a word list with Word2Vec?

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?

I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.



Solution 1:[1]

The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].

Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.

Thus, to create a dictionary object, you may:

my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
    my_dict[key] = model.wv[key]
    # Or my_dict[key] = model.wv.get_vector(key)
    # Or my_dict[key] = model.wv.word_vec(key, use_norm=False)

Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:

{'one': array([-0.06590105,  0.01573388,  0.00682817,  0.53970253, -0.20303348,
   -0.24792041,  0.08682659, -0.45504045,  0.89248925,  0.0655603 ,
   ......
   -0.8175681 ,  0.27659689,  0.22305458,  0.39095637,  0.43375066,
    0.36215973,  0.4040089 , -0.72396156,  0.3385369 , -0.600869  ],
  dtype=float32),
 'two': array([ 0.04694849,  0.13303463, -0.12208422,  0.02010536,  0.05969441,
   -0.04734801, -0.08465996,  0.10344813,  0.03990637,  0.07126121,
    ......
    0.31673026,  0.22282903, -0.18084198, -0.07555179,  0.22873943,
   -0.72985399, -0.05103955, -0.10911274, -0.27275378,  0.01439812],
  dtype=float32),
 'three': array([-0.21048863,  0.4945509 , -0.15050395, -0.29089224, -0.29454648,
    0.3420335 , -0.3419629 ,  0.87303966,  0.21656844, -0.07530259,
    ......
   -0.80034876,  0.02006451,  0.5299498 , -0.6286509 , -0.6182588 ,
   -1.0569025 ,  0.4557548 ,  0.4697938 ,  0.8928275 , -0.7877308 ],
  dtype=float32),
  'four': ......
}

You may also wish to look into the native save / load methods of Gensim's word2vec.

Solution 2:[2]

Gensim tutorial explains it very clearly.

First, you should create word2vec model - either by training it on text, e.g.

 model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

or by loading pre-trained model (you can find them here, for example).

Then iterate over all your words and check for their vectors in the model:

for word in words:
  vector = model[word]

Having that, just write word and vector formatted as you want.

Solution 3:[3]

If you are willing to use python with gensim package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this

from gensim.models import Word2Vec

# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]

# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)

# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()

# Make a dictionary
we_dict = {word:model.wv[word] for word in words}

Solution 4:[4]

You can Directly get the vectors through

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors

and words through

model.wv.vocab.keys()

Hope it helps !

Solution 5:[5]

Using basic python:

all_vectors = []
for index, vector in enumerate(model.wv.vectors):
    vector_object = {}
    vector_object[list(model.wv.vocab.keys())[index]] = vector
    all_vectors.append(vector_object)

Solution 6:[6]

For gensim 4.0:

my_dict = dict({})

for word in word_list:
     my_dict[word] = model.wv.get_vector('0', norm = True) 

Solution 7:[7]

Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector

Get the ordered list of words in the vocabulary

words = list(w for w in model.wv.index_to_key)

Get the vector for 'also'

print(model.wv['also'])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Nikita Astrakhantsev
Solution 3
Solution 4 Wickkiey
Solution 5 TrickOrTreat
Solution 6 keramat
Solution 7 Homa