'How to automatically choose meaning num_features_to_select with best result in select_features from CatBoostClassifier?

I'm writting a class on Python, where I'm trying to automatically pick up a value of num_features_to_select in CatBoostClassifier().select_features(). Right now, function uses enumeration of num_features_to_select values.

Code:

def CatBoost(X_var=df.drop(columns=['status']), y_var=df[['creation_date','status']]):
        from catboost import CatBoostClassifier, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
        from sklearn.model_selection import train_test_split
        from datetime import datetime, timedelta # подключаем библиотеку datetime для работы с датами
        import os
        os.environ['OPENBLAS_NUM_THREADS'] = '10'
        
        valid_time_border = X_var['creation_date'].max()-timedelta(days=7)
        

        X_train, X_test, y_train, y_test = train_test_split(X_var[X_var['creation_date']<=valid_time_border]\
                                                            .drop(columns=['creation_date']),\
                                                            y_var[y_var['creation_date']<=valid_time_border]['status'],\
                                                            test_size=0.3)
        
        X_valid = X_var[X_var['creation_date']>valid_time_border].drop(columns=['creation_date'])
        y_valid = y_var[y_var['creation_date']>valid_time_border]['status']
        best_accurancy = 0
        
        mas_num_features_to_select = [10,20,30,40,50,60]
        
        for i in mas_num_features_to_select:
            # Определяем все переменные
            predict_columns = X_train.columns.to_list()
            # определяем категориальные переменные
            cat_features_num =  np.where(np.isin(X_train[X_train.columns].dtypes, ['bool', 'object']))[0]

            train_pool = Pool(X_train, y_train, cat_features=cat_features_num, feature_names=list(predict_columns))
            test_pool = Pool(X_test, y_test, cat_features=cat_features_num, feature_names=list(predict_columns))

            model = CatBoostClassifier(iterations=round(200), eval_metric='AUC', thread_count = 10)
                  
            summary = model.select_features(
                train_pool,
                eval_set=test_pool,
                features_for_select=predict_columns,
                num_features_to_select=i,
                steps=15,
                algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
                shap_calc_type=EShapCalcType.Regular,
                train_final_model=False,
                logging_level='Silent',
                plot=False
            )
            
            predict_columns = summary['selected_features_names']
            model.fit(X_train, y_train)
            y_pred = model.predict(X_valid)  # предсказываем новые данные
            mislabel = np.sum((y_valid!=y_pred)) # считаем неправильно посчитанные значения
            accurancy = 1 - mislabel/len(y_pred)
            print(accurancy)
            if accurancy > best_accurancy:
                best_accurancy = accurancy
                best_predict_columns = predict_columns
        
        print('Лучшая точность предсказания: '+str(best_accurancy))    
        print('Лучшие фичи:')
        print(best_predict_columns)
        return(best_predict_columns)

I can't find any information about methods which afford to use built in function of automatic feature selection. Is it even possible using CatBoost?



Solution 1:[1]

If I understand your question correctly, you're looking for a way of using select_features to determine how many and which features to include in the model such that performance is maintained/improved while eliminating the maximum number of features. Sadly, your approach seems to be the best for an automated function. CatBoost does not return the features from the iteration with the best performance, only the features remaining after pruning down to the number of features specified in num_features_to_select by iterating steps number of times.

If you can compromise and add a manual step, you can set plot=True and see at which number of features the loss value is minimized, such as in CatBoost's documentation here: graph showing RMSE of model iterations for which features have been recursively removed using the select_features function

If you set steps to the number of features, features will be removed one by one, and you can see the loss for the removal of each feature. You could then manually select the number of features to match that iteration. It would be nice if CatBoost had a "train_best_model" parameter instead of just a "train_final_model" parameter! I don't know if theres a way to capture what this function logs to stdout or outputs in the plot, but that contains the loss value, and would allow you set the value.

Edit: I thought of one more approach that is still a form of iterating over num_features_to_select parameter, but may be interesting.

  1. Set train_final_model=True, steps=1, and num_features_to_select to the width of your dataset
  2. Iteratively subtract 1 from num_features_to_select
  3. At the end of each loop, test the performance of the model
  4. Stop if negative performance change exceeds a threshold (e.g., -5% or -2%)

This may take a while, depending on how long the training takes, but would automatically pick the num_features_to_select as you desire.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1