'Do I need to train on my own data in using bert model as an embedding vector?

When I try the huggingface models and it gives the following error message:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)

And the error message:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

My purpose is to find a pretrained model to create embedding vectors for my text, so that it can be used in downstream text. I don't want to create my own pretrained models to generate the embedding vector. In this case, can I ignore those warning messages, or I need to continue to train on my own data? In another post I learn that "Most of the official models don't have pretrained output layers. The weights are randomly initialized. You need to train them for your task." My understanding is that I don't need to train if I just want to get generic embedding vector for my text based on the public models, like Huggingface. Is that right?

I am new to transformer and please comment.



Solution 1:[1]

Indeed the bert-base-uncased model is already pre-trained and will produce contextualised outputs, which should not be random.

If you're aiming to get a vector representation for entire the input sequence, this is typically done by running your sequence through your model (as you have done) and extracting the representation of the [CLS] token.

The position of the [CLS] token may change depending on the base model you are using, but it is typically the first dimension in the output.

The FeatureExtractionPipeline (documentation here) is a wrapper for the process of extracting contextualised features from the model.

from transformers import FeatureExtractionPipeline

nlp = FeatureExtractionPipeline(
    model=model,
    tokenizer=tokenizer,
)

outputs = nlp(sentence)
embeddings = outputs[0]
cls_embedding = embeddings[0]

Some things to help verify things are going as expected:

  • Check that the [CLS] embedding has the expected dimensionality
  • Check that the [CLS] embedding produces similar vectors for similar text, and different vectors for different text (e.g. by applying cosine similarity)

Additional References: https://github.com/huggingface/transformers/issues/1950

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Glorfindel