'How to fix Spacy Transformers for Spacy version 3.1
I'm having the following problem. I've been trying to replicate example code from this source: Github
I'm using Jupyter Lab environment on Linux and Spacy 3.1
# $ pip install spacy-transformers
# $ python -m spacy download en_trf_bertbaseuncased_lg
import spacy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963
# sentence embeddings
apple1.vector # or apple1.tensor.sum(axis=0)
I'm using Spacy 3.1 so I changed
python -m spacy download en_trf_bertbaseuncased_lg
to
python -m spacy download en_core_web_trf
now I load
nlp = spacy.load("en_trf_bertbaseuncased_lg")
with
nlp = spacy.load("en_core_web_trf")
Now the full code looks like this
import spacy
nlp = spacy.load("en_core_web_trf")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963
# sentence embeddings
apple1.vector # or apple1.tensor.sum(axis=0)
However when running the code my output instead of being:
#0.69861203 #0.5404963
becomes simply
#0.0 #0.0
I also get the following UserWarinig:
<ipython-input-30-ed0c29210d4e>:8: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
print(apple1.similarity(apple2)) #0.69861203
<ipython-input-30-ed0c29210d4e>:8: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
print(apple1.similarity(apple2)) #0.69861203
<ipython-input-30-ed0c29210d4e>:9: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
print(apple1.similarity(apple3)) #0.5404963
<ipython-input-30-ed0c29210d4e>:9: UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
print(apple1.similarity(apple3)) #0.5404963
Does anyone know how to fix this code to calculate similarity correctly?
Solution 1:[1]
Doc.similarity
uses word vectors to calculate similarity, and Transformers models don't include them. You should use en_core_web_lg
or another model with word vectors, or use an alternate method like a custom hook or sentence transformers.
For more details, see the documentation on similarity, or this recent discussion.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | polm23 |