'How to get the P Value in a Variable from OLSResults in Python?
The OLSResults of
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
print(fit.summary())
shows the P values of each attribute to only 3 decimal places.
I need to extract the p value for each attribute like Distance
, CarrierNum
etc. and print it in scientific notation.
I can extract the coefficients using fit.params[0]
or fit.params[1]
etc.
Need to get it for all their P values.
Also what does all P values being 0 mean?
Solution 1:[1]
You need to do fit.pvalues[i]
to get the answer where i
is the index of independent variables. i.e. fit.pvalues[0]
for intercept, fit.pvalues[1]
for Distance
, etc.
You can also look for all the attributes of an object using dir(<object>)
.
Solution 2:[2]
Instead of using fit.summary() you could use fit.pvalues[attributeIndex] in a for loop to print the p-values of all your features/attributes as follows:
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
for attributeIndex in range (0, numberOfAttributes):
print(fit.pvalues[attributeIndex])
==========================================================================
Also what does all P values being 0 mean?
It might be a good outcome. The p-value for each term tests the null hypothesis that the coefficients (b1, b2, ..., bn) are equal to zero causing no effect to the fitting equation y = b0 + b1x1 + b2x2... A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable (y).
On the other hand, a larger (insignificant) p-value suggests that changes in the predictor are not correlated to changes in the response.
Solution 3:[3]
I have used this solution
df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
model = sm.OLS(Y, X).fit()
# Following code snippet will generate sorted dataframe with feature name and it's p-value.
# Hence, you will see most relevant features on the top (p-values will be sorted in ascending order)
d = {}
for i in X.columns.tolist():
d[f'{i}'] = model_ols.pvalues[i]
df_pvalue= pd.DataFrame(d.items(), columns=['Var_name', 'p-Value']).sort_values(by = 'p-Value').reset_index(drop=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | xanjay |
Solution 2 | Marcos Pacheco Jr |
Solution 3 | Suhas_Pote |