'spaCy: Which component of the pre-trained en_core_web_md pipeline contains the morphologizer/morphological analysis?
I need the mentioned pre-trained pipeline to analyze the morphological features of my text.
To disable the rest of the modules I don't need in my pipeline to make the analysis more resource-efficient I tried to find out in which components the morphological analysis for this pipeline is done.
The documentation is unclear. Does someone know this?
Solution 1:[1]
All components enabled:
import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)
> ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
doc = nlp("A man walks into a bar.")
print(doc[2].morph)
> Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
If you disable 'tagger'
morph does not return anything and prints a warning:
\lib\site-packages\spacy\pipeline\lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
warnings.warn(Warnings.W108)
Which is the main answer to your question, since the _md
model does not have an explicit morphologizer
component, disabling one of tagger
or attribute_ruler
means that morph doesnt work any more.
However what I found interesting is that disabling tok2vec
will cause the morph info to be less detailed (no matter if disabled on loading the model or on handing a sentence to the model):
doc = nlp("A man walks into a bar.", disable=["tok2vec"])
print(doc[2].morph)
> Number=Sing
So to answer your question if you need the (full) morphological data you could disable anything except tagger
, attribute_ruler
and tok2vec
:
doc = nlp("A man walks into a bar.", disable=['parser', 'lemmatizer', 'ner'])
print(doc[2].morph)
> Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Also a general tip if you are looking for efficiency in SpaCy - if you are not already doing it, make sure to use nlp.pipe(sentence_list)
instead of nlp(sentence)
to batch process your sentences, this gives you a noticable speedup for larger number of documents, check out my answer in this thread.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ewz93 |