'How to automatically choose meaning num_features_to_select with best result in select_features from CatBoostClassifier?
I'm writting a class on Python, where I'm trying to automatically pick up a value of num_features_to_select in CatBoostClassifier().select_features(). Right now, function uses enumeration of num_features_to_select values.
Code:
def CatBoost(X_var=df.drop(columns=['status']), y_var=df[['creation_date','status']]):
from catboost import CatBoostClassifier, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
from sklearn.model_selection import train_test_split
from datetime import datetime, timedelta # подключаем библиотеку datetime для работы с датами
import os
os.environ['OPENBLAS_NUM_THREADS'] = '10'
valid_time_border = X_var['creation_date'].max()-timedelta(days=7)
X_train, X_test, y_train, y_test = train_test_split(X_var[X_var['creation_date']<=valid_time_border]\
.drop(columns=['creation_date']),\
y_var[y_var['creation_date']<=valid_time_border]['status'],\
test_size=0.3)
X_valid = X_var[X_var['creation_date']>valid_time_border].drop(columns=['creation_date'])
y_valid = y_var[y_var['creation_date']>valid_time_border]['status']
best_accurancy = 0
mas_num_features_to_select = [10,20,30,40,50,60]
for i in mas_num_features_to_select:
# Определяем все переменные
predict_columns = X_train.columns.to_list()
# определяем категориальные переменные
cat_features_num = np.where(np.isin(X_train[X_train.columns].dtypes, ['bool', 'object']))[0]
train_pool = Pool(X_train, y_train, cat_features=cat_features_num, feature_names=list(predict_columns))
test_pool = Pool(X_test, y_test, cat_features=cat_features_num, feature_names=list(predict_columns))
model = CatBoostClassifier(iterations=round(200), eval_metric='AUC', thread_count = 10)
summary = model.select_features(
train_pool,
eval_set=test_pool,
features_for_select=predict_columns,
num_features_to_select=i,
steps=15,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Regular,
train_final_model=False,
logging_level='Silent',
plot=False
)
predict_columns = summary['selected_features_names']
model.fit(X_train, y_train)
y_pred = model.predict(X_valid) # предсказываем новые данные
mislabel = np.sum((y_valid!=y_pred)) # считаем неправильно посчитанные значения
accurancy = 1 - mislabel/len(y_pred)
print(accurancy)
if accurancy > best_accurancy:
best_accurancy = accurancy
best_predict_columns = predict_columns
print('Лучшая точность предсказания: '+str(best_accurancy))
print('Лучшие фичи:')
print(best_predict_columns)
return(best_predict_columns)
I can't find any information about methods which afford to use built in function of automatic feature selection. Is it even possible using CatBoost?
Solution 1:[1]
If I understand your question correctly, you're looking for a way of using select_features
to determine how many and which features to include in the model such that performance is maintained/improved while eliminating the maximum number of features. Sadly, your approach seems to be the best for an automated function. CatBoost does not return the features from the iteration with the best performance, only the features remaining after pruning down to the number of features specified in num_features_to_select
by iterating steps
number of times.
If you can compromise and add a manual step, you can set plot=True
and see at which number of features the loss value is minimized, such as in CatBoost's documentation here:
If you set steps
to the number of features, features will be removed one by one, and you can see the loss for the removal of each feature. You could then manually select the number of features to match that iteration. It would be nice if CatBoost had a "train_best_model" parameter instead of just a "train_final_model" parameter! I don't know if theres a way to capture what this function logs to stdout or outputs in the plot, but that contains the loss value, and would allow you set the value.
Edit: I thought of one more approach that is still a form of iterating over num_features_to_select
parameter, but may be interesting.
- Set
train_final_model=True
,steps=1
, andnum_features_to_select
to the width of your dataset - Iteratively subtract 1 from
num_features_to_select
- At the end of each loop, test the performance of the model
- Stop if negative performance change exceeds a threshold (e.g., -5% or -2%)
This may take a while, depending on how long the training takes, but would automatically pick the num_features_to_select
as you desire.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |