'How to fix Spacy Transformers for Spacy version 3.1

I'm having the following problem. I've been trying to replicate example code from this source: Github

I'm using Jupyter Lab environment on Linux and Spacy 3.1

# $ pip install spacy-transformers
# $ python -m spacy download en_trf_bertbaseuncased_lg

import spacy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")

# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963

# sentence embeddings
apple1.vector  # or apple1.tensor.sum(axis=0)

I'm using Spacy 3.1 so I changed

python -m spacy download en_trf_bertbaseuncased_lg

to

python -m spacy download en_core_web_trf

now I load

nlp = spacy.load("en_trf_bertbaseuncased_lg")

with

nlp = spacy.load("en_core_web_trf")

Now the full code looks like this

import spacy
nlp = spacy.load("en_core_web_trf")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")

# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963

# sentence embeddings
apple1.vector  # or apple1.tensor.sum(axis=0)

However when running the code my output instead of being:

#0.69861203 #0.5404963

becomes simply

#0.0 #0.0

I also get the following UserWarinig:

<ipython-input-30-ed0c29210d4e>:8: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
  print(apple1.similarity(apple2)) #0.69861203
<ipython-input-30-ed0c29210d4e>:8: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
  print(apple1.similarity(apple2)) #0.69861203
<ipython-input-30-ed0c29210d4e>:9: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
  print(apple1.similarity(apple3)) #0.5404963
<ipython-input-30-ed0c29210d4e>:9: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
  print(apple1.similarity(apple3)) #0.5404963

Does anyone know how to fix this code to calculate similarity correctly?



Solution 1:[1]

Doc.similarity uses word vectors to calculate similarity, and Transformers models don't include them. You should use en_core_web_lg or another model with word vectors, or use an alternate method like a custom hook or sentence transformers.

For more details, see the documentation on similarity, or this recent discussion.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 polm23