'The best and simple way to convert labeled text classification data to spaCy v3 format

Let's suppose we have labeled data for text classification in a nice CSV file. We have 2 columns - "text" and "label". I am kind of struggling to understand spacy V3. documentation. If I understand the correctly main sources of examples of spacy v3 documentation are THIS PROJECTS ()https://github.com/explosion/projects/tree/v3/tutorials).

However, the training data are already prepared in the expected JSON nested structure format.

If I want to perform costume text classification in spacy v3 I need to convert the data to the example structure - e.g LIKE HERE (https://github.com/explosion/projects/blob/v3/tutorials/textcat_docs_issues/assets/docs_issues_eval.jsonl).

How to get from pandas data frame to here? Does prodigy support labeled data to spacy format? Let's have small example of the dataset

pd.DataFrame({
    "TEXT":[
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written",
    "who came up with this stupid idea?",
    "This is just completely wrong!!",
    "Get out of here now!!!!"],
    "LABEL": [
        "POS", "POS", "POS", "NEG", "NEG", "NEG", "RUDE", "RUDE", "RUDE"
    ]

})


Solution 1:[1]

In spaCy v3 training data is basically what you want your output to look like. So for text classification you create a doc with your text and then set doc.cats for your categories. After you've trained a model it'll do the same thing for new docs.

Whether your data is in a dataframe or not is irrelevant. You just need to iterate over the underlying values.

You can do something like this.

texts = ... list of texts ...
labels = ... aligned list of labels ...


for text, label in zip(texts, labels):
    doc = nlp(text)
    doc.cats[label] = True

# save the docs as a DocBin (.spacy file)

The project you linked to has a similar script here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 polm23