'Using categorical variables in statsmodels OLS class
I want to use statsmodels
OLS class to create a multiple regression model. Consider the following dataset:
import statsmodels.api as sm
import pandas as pd
import numpy as np
dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90}
df = pd.DataFrame.from_dict(dict)
x = data[['debt_ratio', 'industry']]
y = data['cash_flow']
def reg_sm(x, y):
x = np.array(x).T
x = sm.add_constant(x)
results = sm.OLS(endog = y, exog = x).fit()
return results
When I run the following code:
reg_sm(x, y)
I get the following error:
TypeError: '>=' not supported between instances of 'float' and 'str'
I've tried converting the industry
variable to categorical, but I still get an error. I'm out of options.
Solution 1:[1]
You're on the right path with converting to a Categorical dtype. However, once you convert the DataFrame to a NumPy array, you get an object
dtype (NumPy arrays are one uniform type as a whole). This means that the individual values are still underlying str
which a regression definitely is not going to like.
What you might want to do is to dummify this feature. Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:
>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
... 'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
... 'debt_ratio':np.random.randn(5),
... 'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
... data,
... pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
industry debt_ratio cash_flow finance hospitality mining transportation
0 mining 0.357440 88.856850 0 0 1 0
1 transportation 0.377538 89.457560 0 0 0 1
2 hospitality 1.382338 89.451292 0 1 0 0
3 finance 1.175549 90.208520 1 0 0 0
4 entertainment -0.939276 90.212690 0 0 0 0
Now you have dtypes that statsmodels can better work with. The purpose of drop_first
is to avoid the dummy trap:
>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>
Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict
.
Solution 2:[2]
I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify
. And converting to string
doesn't work for me.
For anyone looking for a solution without onehot-encoding the data, The R interface provides a nice way of doing this:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90}
df = pd.DataFrame.from_dict(dict)
x = df[['debt_ratio', 'industry']]
y = df['cash_flow']
# NB. unlike sm.OLS, there is "intercept" term is included here
smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()
Reference: https://www.statsmodels.org/stable/example_formulas.html#categorical-variables
Solution 3:[3]
Just another example from a similar case for categorical variables, which gives correct result compared to a statistics course given in R (Hanken, Finland).
import wooldridge as woo
import statsmodels.formula.api as smf
import numpy as np
df = woo.dataWoo('beauty')
print(df.describe)
df['abvavg'] = (df['looks']>=4).astype(int) # good looking
df['belavg'] = (df['looks']<=2).astype(int) # bad looking
df_female = df[df['female']==1]
df_male = df[df['female']==0]
results_female = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_female).fit()
print(f"FEMALE results, summary \n {results_female.summary()}")
results_male = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_male).fit()
print(f"MALE results, summary \n {results_male.summary()}")
Terveisin, Markus
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Markus Kaukonen |