'Why is the output of the sklearn.feature_selection chi2 nan - can a feature with no variation not be compared to a feature with variation?
I want to build a heat map that correlates whether a feature is present in each column, with whether the feature is present in every other column.
I have this:
import sys
import pandas as pd
from sklearn.feature_selection import chi2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame([[0,0,0],[0,1,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,1,0],[0,0,0],
[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,1,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],
[0,0,0],[0,0,0],[0,1,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,1,0],[0,0,0],[0,0,0],[0,0,0],[0,1,0],[0,1,0]],columns=['feature1','feature2','feature3'])
# Resultant Dataframe will be a dataframe where the column names and Index will be the same
# This is a matrix similar to correlation matrix which we get after df.corr()
# Initialize the values in this matrix with 0
resultant = pd.DataFrame(data=[(0 for i in range(len(df.columns))) for i in range(len(df.columns))],
columns=list(df.columns))
resultant.set_index(pd.Index(list(df.columns)), inplace = True)
# Finding p_value for all columns and putting them in the resultant matrix
for i in list(df.columns):
for j in list(df.columns):
if i != j:
chi2_val, p_val = chi2(np.array(df[i]).reshape(-1, 1), np.array(df[j]).reshape(-1, 1))
resultant.loc[i,j] = p_val
print(resultant)
fig = plt.figure(figsize=(6,6))
sns.heatmap(resultant, annot=True, cmap='Blues')
plt.title('Chi-Square Test Results')
plt.show()
It generates a heat map:
However the actual scores are like this:
feature1 feature2 feature3
feature1 0.000000 0.867632 NaN
feature2 0.862684 0.000000 NaN
feature3 NaN NaN 0.0
This is a realistic interpretation of my real data, whether there are only a few missing data points in each column and I wanted to check their relation to all the other columns. Is it not feasible to do this (because for example, in this case, feature 2 is 1 and 0s, but feature 3 is all 0s, so therefore is it just not possible to calculate the chi squared between feature 2 and 3)?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|