'sklearn stratified train_test_split the least popular class error
I have multilabel dataset (pd.DataFrame
) which looks like this:
This is value_counts of flatten tags
column:
101 4450171
86 3933972
45 3468383
0 2801217
46 2621773
...
4681 1000
2923 1000
4580 1000
7569 1000
6955 1000
Length: 7657, dtype: int64
Then I use train_test_split
from sklearn
with stratify
argument to split dataset with balanced distribution:
train_df, test_df = train_test_split(
df,
test_size=0.02,
stratify=df["tags"].values,
)
And I get this error:
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Why? I see that the least populated class has 1000 samples. Does it actually compare lists instead of list values? I based on this article: https://keras.io/examples/nlp/multi_label_classification/
Solution 1:[1]
As you said, train_test_split
interprets each list of tags as a label, it doesn't matter what it contains. A sample with tags [1, 2, 3]
will not be identified the same as a sample with tags [1, 2]
. Hence, you cannot flatten the tags
column to check the label counts.
The solution, if you want to keep these labels, is to drop the observations with labels that are not enough represented (e.g., with value_counts() == 1
. In fact, this is also what they do in the article you linked (see the last code snippet of the "Perform exploratory data analysis" paragraph):
# Filtering the rare terms.
arxiv_data_filtered = arxiv_data.groupby("terms").filter(lambda x: len(x) > 1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | user2246849 |