'Flexible word count in Pytorch Embedding

The Embedding class in Pytorch takes in num_embedding as a parameter. According to the doc, num_embedding is "size of the dictionary of embeddings". I am curious about the following two cases when creating an embedding object:

  1. The num_embedding, or word count in a database, is unknown before we create an embedding.
  2. The num_embedding, or word count, is flexible. For example, initially I create an embedding with num_embedding, or word count, 1000. Later I have some new elements added in. For example, I have 10 new words on top of the existing 1000 words, how to modify the existing embedding (keep the embedding_dim same) to adapt the change?


Solution 1:[1]

I don't think you can change the num_embeddings after initializing the embedding. However, maybe you could concatenate the embedding weights to a new tensor to form a new weight tensor which extends the vocabulary.

I hope this example helps:

import torch
import torch.nn as nn

# if values in input are not in the interval [0, 4] an error is thrown
embedding = nn.Embedding(num_embeddings = 5, embedding_dim = 3)

# input values can now be in the interval [0, 5]
embedding.weight = nn.Parameter(torch.cat((embedding.weight, torch.randn(1, 3))))

input = torch.LongTensor([[1, 5], [4, 3]])

Similarly you could add more than one new vocabulary by changing n in torch.rand(n, 3).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user18842383