'How do I know the order of the classes in a CatBoost classifier weights?

This is a pretty dumb question, but I couldn't find anywhere, so I will take my chances in here...

I'm building a classifier using CatBoost. Since this is a NLP problem, my features are the words/tokens in the tweet and the target is the classification. Basically, I have something like this:

tweet                       target
I was looking at her...     happy
It's really hot today       mad
Last Friday night was...    sad
.
.
.

Due to company compliance, I can't share the dataset, but I guess you guys will understand and can even try using another dataframe (this one is very similar https://www.kaggle.com/datatattle/covid-19-nlp-text-classification). I have 5 classes as target and the dataset is imbalanced. The weights are:

happy: 0.80
neutral: 0.11
mad: 0.080
sad: 0.005
confused: 0.005

So, after splitting into training and test, stratified by the target, I was using this pipeline:

pipe = Pipeline([
            (
                "tokenizer",CountVectorizer(analyzer= 'word',
                                            ngram_range=(1, 2),
                                            token_pattern=r"\w+",
                                            stop_words="english"
                                        )),
            ("feature_selection", SelectKBest(SelectKBest, k=90)),
            ("clf",  CatBoostClassifier())
])

pipe.fit(X_train, y_train)

Since the dataset is imbalanced, how can I use class_weights in here? I saw a tutorial doing something similar to this:

CatBoostClassifier(class_weights=[1-0.8, 1-0.11, 1-0.08, 1-0.005, 1-0.005])

But how do I know which one is the correct order?

I tried using the name, like class_weights={'happy': 1-0.8...}, but it didn't work as well.



Solution 1:[1]

It would be helpful to know what happened to make you determine adding the class weights as above didn't work as well. Do you mean the model was not as performant, or did you run into an error?

Your second example is how you set class weights for label which does not consist of consecutive integers (e.g., class_weights={'a': 1.0, 'b': 0.5, 'c': 2.0}). See the CatBoost documentation on class_weights.

You could also try auto_class_weights='Balanced', since you are trying to set the weights to the inverse of the class balance. See the CatBoost documentation on auto_class_weights.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 K. Thorspear