'sklearn stratified train_test_split the least popular class error

I have multilabel dataset (pd.DataFrame) which looks like this:

This is value_counts of flatten tags column:

101     4450171
86      3933972
45      3468383
0       2801217
46      2621773
         ...   
4681       1000
2923       1000
4580       1000
7569       1000
6955       1000
Length: 7657, dtype: int64

Then I use train_test_split from sklearn with stratify argument to split dataset with balanced distribution:

train_df, test_df = train_test_split(
    df,
    test_size=0.02,
    stratify=df["tags"].values,
)

And I get this error:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Why? I see that the least populated class has 1000 samples. Does it actually compare lists instead of list values? I based on this article: https://keras.io/examples/nlp/multi_label_classification/



Solution 1:[1]

As you said, train_test_split interprets each list of tags as a label, it doesn't matter what it contains. A sample with tags [1, 2, 3] will not be identified the same as a sample with tags [1, 2]. Hence, you cannot flatten the tags column to check the label counts.

The solution, if you want to keep these labels, is to drop the observations with labels that are not enough represented (e.g., with value_counts() == 1. In fact, this is also what they do in the article you linked (see the last code snippet of the "Perform exploratory data analysis" paragraph):

# Filtering the rare terms.
arxiv_data_filtered = arxiv_data.groupby("terms").filter(lambda x: len(x) > 1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user2246849