'SHAP: XGBoost and LightGBM difference in shap_values calculation

I have this code in visual studio code:

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb 
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, cross_val_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, accuracy_score

df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=22)

xgb_model = xgb.XGBClassifier(eval_metric='mlogloss',use_label_encoder =False)
xgb_fitted = xgb_model.fit(X_train, y_train)

explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")

when I run this code, I am getting this error:

Summary plots need a matrix of shap_values, not a vector.

on the shap.summary_plot line.

What is the problem and how can I solve it?

The above code is based on this code sample: https://github.com/slundberg/shap.

the dataset is as follow:

Cat1,Cat2,Age,Cat3,Cat4,target
0,0,18,1,0,1
0,0,17,1,0,1
0,0,15,1,1,1
0,0,15,1,0,1
0,0,16,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,17,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,0,15,1,0,1
0,0,15,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,1,17,1,0,0
0,1,16,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,15,0,0,1
0,0,16,1,0,1
0,1,15,1,0,1

Please note that actual data has 700 rows, but I copied a small portion of it just to show how data is look like.

Edit 1

The main reason for this question is to understand how the code should be changed when using different classiferes.

I had originally a sample code with lgmb which worked but when I changed it to xgboost, It generate error on summary plot.

To show what I mean, I developed the following sample code:

import pandas as pd
import shap
import lightgbm as lgb
import xgboost as xgb 
from sklearn.model_selection import train_test_split

df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)

# select one of the two models
model = xgb.XGBClassifier()
#model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)

explainer = shap.Explainer(model_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")

if I use LGBM model, it works well and if I use XGBoost, it failed. What is the difference and how should I change the code that XGBoost behave similarly to LGBM and application works.



Solution 1:[1]

Assuming you have copied the data from the question above, the following will do:

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    cross_validate,
    cross_val_score,
)
from sklearn.metrics import (
    classification_report,
    ConfusionMatrixDisplay,
    accuracy_score,
)

df = pd.read_clipboard(sep=",")

target = df.pop("target")
X_train, X_test, y_train, y_test = train_test_split(
    df, target, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)


xgb_model = xgb.XGBClassifier(eval_metric="mlogloss", use_label_encoder=False)
xgb_fitted = xgb_model.fit(X_train, y_train)

explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
# shap.summary_plot(shap_values, X_test, plot_type="bar")

enter image description here

The code you pasted assumes 2 ["identical"] arrays of shap values per class "0" and "1". Something has changed in the way explainer.shap_values calculates SHAP values for XGBoost since it was printed. So, now supplying shap_values (without class index) is just enough.

Solution 2:[2]

Notice that with summary_plot() you want to visualize which features in general are more important to the model, so it requires a matrix

For single output explanations this is a matrix of SHAP values (# samples x # features).

the result from shap_values = explainer.shap_values(X_test) is a matrix of shape (n_samples, 5) (columns in sample data).

When you take the first sample shap_values[0] is a vector that explains first prediction feature contributions, that's why Summary plots need a matrix of shap_values, not a vector. raises.

If you want to visualize individual predictions shap_values[0] you could use a force_plot

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0])

enter image description here

EDIT

The difference between the outputs of the two models is due to how the out result is calculated. Checking the source code for lightgbm calculation once the variable phi is calculated, it concatenates the values in the following way

phi = np.concatenate((0-phi, phi), axis=-1)

generating an array of shape (n_samples, n_features*2).

This shape is different from X_test, that is, phi.shape[1] != X.shape[1] + 1, so it reshapes it two a three dimensional array

phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)

Finally the output is a list of length two

out = [phi[:, i, :-1] for i in range(phi.shape[1])]
out
>>>
[array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        ...
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        ...  
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])]

See examples below to see how out calculation differs.

Example with LightGBM

import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb 
import shap.explainers as explainers
from sklearn.model_selection import train_test_split

df = pd.read_csv("test_data.csv")
target=df.pop('target')

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)

model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)

# Calculate phi from https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L347
tree_limit = -1 if explainer.model.tree_limit is None else explainer.model.tree_limit
phi = explainer.model.original_model.predict(X_test, num_iteration=tree_limit, pred_contrib=True)

# Objective is binary: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L349
if explainer.model.original_model.params['objective'] == 'binary':
    phi = np.concatenate((0-phi, phi), axis=-1)

# Phi shape is different from X_test:
if phi.shape[1] != X_test.shape[1] + 1:
    phi = phi.reshape(X_test.shape[0], phi.shape[1]//(X_test.shape[1]+1), X_test.shape[1]+1)

# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L370
expected_value = [phi[0, i, -1] for i in range(phi.shape[1])]
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
expected_value
>>> [-0.8109302162163288, 0.8109302162163288]
out
>>> 
[array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])]

Example with XGBoost

import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb 
import shap.explainers as explainers
from sklearn.model_selection import train_test_split

df = pd.read_csv("test_data.csv")
target=df.pop('target')

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)

model = xgb.XGBClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)

# Transform data to DMatrix: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L326
if not isinstance(X_test, xgb.core.DMatrix):
    X_test = xgb.DMatrix(X_test)

tree_limit = explainer.model.tree_limit

# Calculate phi: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L331
phi = explainer.model.original_model.predict(
    X_test, ntree_limit=tree_limit, pred_contribs=True,
    approx_contribs=False, validate_features=False
)

# Model output is "raw": https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L339
model_output_vals = explainer.model.original_model.predict(
    X_test, ntree_limit=tree_limit, output_margin=True,
    validate_features=False
)
model_output_vals
>>> array([-0.11323176, -0.11323176,  0.5436669 ,  0.87637275,  1.5332711 ,
       -0.11323176,  1.5332711 ,  0.5436669 ,  1.5332711 ,  0.5436669 ,
        0.87637275,  0.87637275, -0.11323176,  0.5436669 ], dtype=float32)

# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L374
expected_value_ = phi[0, -1]
expected_value_
>>> 0.817982
out_ = phi[:, :-1]
out_
>>>
array([[ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ]],
      dtype=float32)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2