'SHAP: XGBoost and LightGBM difference in shap_values calculation
I have this code in visual studio code:
import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, cross_val_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, accuracy_score
df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=22)
xgb_model = xgb.XGBClassifier(eval_metric='mlogloss',use_label_encoder =False)
xgb_fitted = xgb_model.fit(X_train, y_train)
explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")
when I run this code, I am getting this error:
Summary plots need a matrix of shap_values, not a vector.
on the shap.summary_plot
line.
What is the problem and how can I solve it?
The above code is based on this code sample: https://github.com/slundberg/shap.
the dataset is as follow:
Cat1,Cat2,Age,Cat3,Cat4,target
0,0,18,1,0,1
0,0,17,1,0,1
0,0,15,1,1,1
0,0,15,1,0,1
0,0,16,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,17,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,0,15,1,0,1
0,0,15,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,1,17,1,0,0
0,1,16,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,15,0,0,1
0,0,16,1,0,1
0,1,15,1,0,1
Please note that actual data has 700 rows, but I copied a small portion of it just to show how data is look like.
Edit 1
The main reason for this question is to understand how the code should be changed when using different classiferes.
I had originally a sample code with lgmb which worked but when I changed it to xgboost, It generate error on summary plot.
To show what I mean, I developed the following sample code:
import pandas as pd
import shap
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import train_test_split
df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
# select one of the two models
model = xgb.XGBClassifier()
#model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.Explainer(model_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")
if I use LGBM model, it works well and if I use XGBoost, it failed. What is the difference and how should I change the code that XGBoost behave similarly to LGBM and application works.
Solution 1:[1]
Assuming you have copied the data from the question above, the following will do:
import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import (
train_test_split,
StratifiedKFold,
cross_validate,
cross_val_score,
)
from sklearn.metrics import (
classification_report,
ConfusionMatrixDisplay,
accuracy_score,
)
df = pd.read_clipboard(sep=",")
target = df.pop("target")
X_train, X_test, y_train, y_test = train_test_split(
df, target, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
xgb_model = xgb.XGBClassifier(eval_metric="mlogloss", use_label_encoder=False)
xgb_fitted = xgb_model.fit(X_train, y_train)
explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
# shap.summary_plot(shap_values, X_test, plot_type="bar")
The code you pasted assumes 2 ["identical"] arrays of shap values per class "0" and "1". Something has changed in the way explainer.shap_values
calculates SHAP values for XGBoost
since it was printed. So, now supplying shap_values
(without class index) is just enough.
Solution 2:[2]
Notice that with summary_plot()
you want to visualize which features in general are more important to the model, so it requires a matrix
For single output explanations this is a matrix of SHAP values (# samples x # features).
the result from shap_values = explainer.shap_values(X_test)
is a matrix of shape (n_samples, 5)
(columns in sample data).
When you take the first sample shap_values[0]
is a vector that explains first prediction feature contributions, that's why Summary plots need a matrix of shap_values, not a vector.
raises.
If you want to visualize individual predictions shap_values[0]
you could use a force_plot
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0])
EDIT
The difference between the outputs of the two models is due to how the out
result is calculated. Checking the source code for lightgbm
calculation once the variable phi
is calculated, it concatenates the values in the following way
phi = np.concatenate((0-phi, phi), axis=-1)
generating an array of shape (n_samples, n_features*2)
.
This shape is different from X_test
, that is, phi.shape[1] != X.shape[1] + 1
, so it reshapes it two a three dimensional array
phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)
Finally the output is a list of length two
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
out
>>>
[array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
...
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
...
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])]
See examples below to see how out
calculation differs.
Example with LightGBM
import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb
import shap.explainers as explainers
from sklearn.model_selection import train_test_split
df = pd.read_csv("test_data.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)
model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)
# Calculate phi from https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L347
tree_limit = -1 if explainer.model.tree_limit is None else explainer.model.tree_limit
phi = explainer.model.original_model.predict(X_test, num_iteration=tree_limit, pred_contrib=True)
# Objective is binary: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L349
if explainer.model.original_model.params['objective'] == 'binary':
phi = np.concatenate((0-phi, phi), axis=-1)
# Phi shape is different from X_test:
if phi.shape[1] != X_test.shape[1] + 1:
phi = phi.reshape(X_test.shape[0], phi.shape[1]//(X_test.shape[1]+1), X_test.shape[1]+1)
# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L370
expected_value = [phi[0, i, -1] for i in range(phi.shape[1])]
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
expected_value
>>> [-0.8109302162163288, 0.8109302162163288]
out
>>>
[array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])]
Example with XGBoost
import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb
import shap.explainers as explainers
from sklearn.model_selection import train_test_split
df = pd.read_csv("test_data.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)
model = xgb.XGBClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)
# Transform data to DMatrix: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L326
if not isinstance(X_test, xgb.core.DMatrix):
X_test = xgb.DMatrix(X_test)
tree_limit = explainer.model.tree_limit
# Calculate phi: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L331
phi = explainer.model.original_model.predict(
X_test, ntree_limit=tree_limit, pred_contribs=True,
approx_contribs=False, validate_features=False
)
# Model output is "raw": https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L339
model_output_vals = explainer.model.original_model.predict(
X_test, ntree_limit=tree_limit, output_margin=True,
validate_features=False
)
model_output_vals
>>> array([-0.11323176, -0.11323176, 0.5436669 , 0.87637275, 1.5332711 ,
-0.11323176, 1.5332711 , 0.5436669 , 1.5332711 , 0.5436669 ,
0.87637275, 0.87637275, -0.11323176, 0.5436669 ], dtype=float32)
# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L374
expected_value_ = phi[0, -1]
expected_value_
>>> 0.817982
out_ = phi[:, :-1]
out_
>>>
array([[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ]],
dtype=float32)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |