'How to do point biserial correlation for multiple columns in one iteration
I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
Solution 1:[1]
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())
and this give you the Correlation Matrix of the dataframe
Solution 2:[2]
You can use the pd.DataFrame.corrwith()
function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Youssef_boughanmi |
Solution 2 |