'XGBoost Model performance
I am trying to use XGBoost for classification. I am pretty doubtful on its accuracy. I have applied it with default parameters and the precision is 100%.
xg_cl_default = xgb.XGBClassifier()
xg_cl_default.fit(trainX, trainY)
preds = xg_cl_default.predict(testX)
precision_score(testY,preds)
# 1.0
However my data is imbalance so I use scale_pos_weight parameter along with few other parameters given below:
ratio = int(df_final.filter(col('isFraud')==0).count()/df_final.filter(col('isFraud')==1).count())
xg_cl = xgb.XGBClassifier(scale_pos_weight = ratio, n_estimators=50)
eval_set = [(valX, valY.values.ravel())]
xg_cl.fit(trainX, trainY.values.ravel(),eval_metric="error", early_stopping_rounds=10,eval_set=eval_set, verbose=True)
preds = xg_cl_default.predict(testX)
precision_score(testY,preds)
# 1.0
In both the cases my precision is 100% and Recall is 99%. This is not acceptable for me as data is highly imbalance.
Solution 1:[1]
For imbalanced datasets, a more appropriate evaluation metric is the area under the precision-recall curve, so set eval_metric="aucpr"
instead.
Also, you should tune the parameters of XGBoost using cross-validation, again with the area under the precision-recall curve as the evaluation metric. Cross-validation can be done in a variety of ways and a quick search should get you a number of code examples. Based on the code you shared, unless your problem is trivial, it is unlikely that you can get a meaningful model without careful tuning of the parameters.
Lastly, you can plot the confusion matrix using Scikit-Learn from the true labels and the predicted labels to get a sense of whether the model is making meaningful predictions.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | lightalchemist |