'Get Confidence probability Scores for each Predicted Result in Catboost Classifier

I have built a machine learning model using Catboost classifier to predict the categoryname of my result as per below screenshot1. However, if I get an unknown as input or any input with which the model has not been trained with, then I need to return it as null.

My idea to approach this is was based on the Probability of confidence score as per below scrrenshot2 (Expected Output). For known input the model would have high probability score and for any unknown unseen input the model would have low confidence score.

How can I achieve this and add probability column to my predicted results as per below screenshot2 (Expected Output)?

Code I am working with

pred = pipe_model_.predict(df_unseen)
predict_proba = pipe_model_.predict_proba(df_unseen)
# Get predicted RawFormulaVal
preds_raw = pipe_model_.predict(df_unseen, 
                          prediction_type='RawFormulaVal')

Output of above code on Predict_proba is below enter image description here

Sample Input Trained Dataframe (Screenshot 1)

enter image description here

Expected Predicted Output is as below (Screenshot 2) and yellow highlighted is the one which the model has never seen before or trained with so the probability is low and I can write a if condition to omit that as per my requirement

enter image description here



Solution 1:[1]

To summarize your requirements:

  1. Return the probability of the label predicted by the model
  2. If the input (Name) was not part of the training set, null the probability

If this is correct, then for requirement 1, the only step you're missing for is the mapping from the .predict_proba() output to the classes. You can call .classes_ to recover the mapping. See related answer. With this mapping, you can store the prediction as well as the probabilities for each class, and present only the probability for the class that was predicted.

For requirement 2, you will need to keep a record of all the inputs (Names) you provided in training. You could keep it in a .txt file and load it into a list. Then, after predictions are made, you can exclude any row which had a new or unknown input.

2 is an odd requirement though. If you know the Label for each of the Names you have seen before, and you don't want to use the output of the model in the cases where you haven't seen the Name before, the use case may be better served with a hard-coded lookup from Name to Label. The purpose of a model is to predict the Label when you haven't seen Name before, after training it on patterns of Names (e.g., if you get the new Name "Transt," the model would hopefully predict "Logistics" after being trained on "Transit" > "Logistics" and "Transiting" > "Logistics").

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 K. Thorspear