'Get a feature importance from SHAP Values
iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
I have tried to store them in a df:
rf_resultX = pd.DataFrame(shap_values, columns = ['shap_values'])
but got: ValueError: Shape of passed values is (18, 142), indices imply (18, 1)
142 - the number of the features. 18 - i have no idea.
I believe it works as follows:
- shap_values need to be averaged.
- and paired with the feature names: pd.DataFrame(feature_names, columns = ['feature_names'])
Does anybody have an experience, how to interpret shap_values? At first i thought, that the number of values are the number of features x number of rows.
Solution 1:[1]
Combining the other two answers like this worked for me.
feature_names = X_train.columns
rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
vals = np.abs(rf_resultX.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
shap_importance.head()
Solution 2:[2]
shap_values have (num_rows, num_features)
shape; if you want to convert it to dataframe, you should pass the list of feature names to the columns
parameter: rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
.
Each sample has its own shap value for each feature; the shap value tells you how much that feature has contributed to the prediction for that particular sample; this is called a local explanation. You could average shap values for each feature to get a feeling of global feature importance, but I'd suggest you take a look at the documentation since the shap package itself provides much more powerful visualizations/interpretations.
Solution 3:[3]
For the latest version 0.40.0:
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
Solution 4:[4]
From https://github.com/slundberg/shap/issues/632
vals = np.abs(shap_values.values).mean(0) feature_names = train_x.columns() feature_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name','feature_importance_vals']) feature_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True) feature_importance.head()
Solution 5:[5]
I wrote a short function for this which also works for multi-class classifications. It expects the data as a pandas DataFrame, a list of shap value arrays with one array for each class, and optionally a list of columns for which you want the average shap values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
def shap_feature_ranking(data, shap_values, columns=[]):
if not columns: columns = data.columns.tolist() # If columns are not given, take all columns
c_idxs = []
for column in columns: c_idxs.append(data.columns.get_loc(column)) # Get column locations for desired columns in given dataframe
if isinstance(shap_values, list): # If shap values is a list of arrays (i.e., several classes)
means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] # Compute mean shap values per class
shap_means = np.sum(np.column_stack(means), 1) # Sum of shap values over all classes
else: # Else there is only one 2D array of shap values
assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
shap_means = np.abs(shap_values).mean(axis=0)
# Put into dataframe along with columns and sort by shap_means, reset index to get ranking
df_ranking = pd.DataFrame({'feature': columns, 'mean_shap_value': shap_means}).sort_values(by='mean_shap_value', ascending=False).reset_index(drop=True)
df_ranking.index += 1
return df_ranking
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | sengul_karaderili |
Solution 2 | Manu Valdés |
Solution 3 | |
Solution 4 | |
Solution 5 |