'spaCy: Which component of the pre-trained en_core_web_md pipeline contains the morphologizer/morphological analysis?

I need the mentioned pre-trained pipeline to analyze the morphological features of my text.

To disable the rest of the modules I don't need in my pipeline to make the analysis more resource-efficient I tried to find out in which components the morphological analysis for this pipeline is done.

The documentation is unclear. Does someone know this?



Solution 1:[1]

All components enabled:

import spacy

nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)
> ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

doc = nlp("A man walks into a bar.")
print(doc[2].morph)
> Number=Sing|Person=3|Tense=Pres|VerbForm=Fin

If you disable 'tagger' morph does not return anything and prints a warning:

\lib\site-packages\spacy\pipeline\lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)

Which is the main answer to your question, since the _md model does not have an explicit morphologizer component, disabling one of tagger or attribute_ruler means that morph doesnt work any more.

However what I found interesting is that disabling tok2vec will cause the morph info to be less detailed (no matter if disabled on loading the model or on handing a sentence to the model):

doc = nlp("A man walks into a bar.", disable=["tok2vec"])
print(doc[2].morph)
> Number=Sing

So to answer your question if you need the (full) morphological data you could disable anything except tagger, attribute_ruler and tok2vec:

doc = nlp("A man walks into a bar.", disable=['parser', 'lemmatizer', 'ner'])
print(doc[2].morph)
> Number=Sing|Person=3|Tense=Pres|VerbForm=Fin

Also a general tip if you are looking for efficiency in SpaCy - if you are not already doing it, make sure to use nlp.pipe(sentence_list) instead of nlp(sentence) to batch process your sentences, this gives you a noticable speedup for larger number of documents, check out my answer in this thread.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ewz93