'The best and simple way to convert labeled text classification data to spaCy v3 format
Let's suppose we have labeled data for text classification in a nice CSV file. We have 2 columns - "text" and "label". I am kind of struggling to understand spacy V3. documentation. If I understand the correctly main sources of examples of spacy v3 documentation are THIS PROJECTS ()https://github.com/explosion/projects/tree/v3/tutorials).
However, the training data are already prepared in the expected JSON nested structure format.
If I want to perform costume text classification in spacy v3 I need to convert the data to the example structure - e.g LIKE HERE (https://github.com/explosion/projects/blob/v3/tutorials/textcat_docs_issues/assets/docs_issues_eval.jsonl).
How to get from pandas data frame to here? Does prodigy support labeled data to spacy format? Let's have small example of the dataset
pd.DataFrame({
"TEXT":[
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post",
"i dislike this article",
"this is not well written",
"who came up with this stupid idea?",
"This is just completely wrong!!",
"Get out of here now!!!!"],
"LABEL": [
"POS", "POS", "POS", "NEG", "NEG", "NEG", "RUDE", "RUDE", "RUDE"
]
})
Solution 1:[1]
In spaCy v3 training data is basically what you want your output to look like. So for text classification you create a doc with your text and then set doc.cats
for your categories. After you've trained a model it'll do the same thing for new docs.
Whether your data is in a dataframe or not is irrelevant. You just need to iterate over the underlying values.
You can do something like this.
texts = ... list of texts ...
labels = ... aligned list of labels ...
for text, label in zip(texts, labels):
doc = nlp(text)
doc.cats[label] = True
# save the docs as a DocBin (.spacy file)
The project you linked to has a similar script here.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | polm23 |