'How to mock spacy models / Doc objects for unit tests?
Loading spacy models slows down running my unit tests. Is there a way to mock spacy models or Doc objects to speed up unit tests?
Example of a current slow tests
import spacy
nlp = spacy.load("en_core_web_sm")
def test_entities():
text = u"Google is a company."
doc = nlp(text)
assert doc.ents[0].text == u"Google"
Based on the docs my approach is
Constructing the Vocab and Doc manually and setting the entities as tuples.
from spacy.vocab import Vocab
from spacy.tokens import Doc
def test()
alphanum_words = u"Google Facebook are companies".split(" ")
labels = [u"ORG"]
words = alphanum_words + [u"."]
spaces = len(words) * [True]
spaces[-1] = False
spaces[-2] = False
vocab = Vocab(strings=(alphanum_words + labels))
doc = Doc(vocab, words=words, spaces=spaces)
def get_hash(text):
return vocab.strings[text]
entity_tuples = tuple([(get_hash(labels[0]), 0, 1)])
doc.ents = entity_tuples
assert doc.ents[0].text == u"Google"
Is there a cleaner more Pythonic solution for mocking spacy objects for unit tests for entities?
Solution 1:[1]
This is a great question actually! I'd say your instinct is definitely right: If all you need is a Doc
object in a given state and with given annotations, always create it manually wherever possible. And unless you're explicitly testing a statistical model, avoid loading it in your unit tests. It makes the tests slow, and it introduces too much unnecessary variance. This is also very much in line with the philosophy of unit testing: you want to be writing independent tests for one thing at a time (not one thing plus a bunch of third-party library code plus a statistical model).
Some general tips and ideas:
- If possible, always construct a
Doc
manually. Avoid loading models orLanguage
subclasses. - Unless your application or test specifically needs the
doc.text
, you do not have to set thespaces
. In fact, I leave this out in about 80% of the tests I write, because it really only becomes relevant when you're putting the tokens back together. - If you need to create a lot of
Doc
objects in your test suite, you could consider using a utility function, similar to theget_doc
helper we use in the spaCy test suite. (That function also shows you how the individual annotations are set manually, in case you need it.) - Use (session-scoped) fixtures for the shared objects, like the
Vocab
. Depending on what you're testing, you might want to explicitly use theEnglish
vocab. In the spaCy test suite, we do this by setting up anen_vocab
fixture in theconftest.py
. - Instead of setting the
doc.ents
to a list of tuples, you can also make it a list ofSpan
objects. This looks a bit more straightforward, is easier to read, and in spaCy v2.1+, you can also pass a string as a label:
def test_entities(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
doc.ents = [Span(doc, 0, 1, label="ORG")]
assert doc.ents[0].text == "Hello"
- If you do need to test a model (e.g. in the test suite that makes sure that your custom models load and run as expected) or a language class like
English
, put them in a session-scoped fixture. This means that they'll only be loaded once per session instead of once per test. Language classes are lazy-loaded and may also take some time to load, depending on the data they contain. So you only want to do this once.
# Note: You probably don't have to do any of this, unless you're testing your
# own custom models or language classes.
@pytest.fixture(scope="session")
def en_core_web_sm():
return spacy.load("en_core_web_sm")
@pytest.fixture(scope="session")
def en_lang_class():
lang_cls = spacy.util.get_lang_class("en")
return lang_cls()
def test(en_lang_class):
doc = en_lang_class("Hello world")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | pypae |