'How to reduce a categorical variable in a logistic regression model in R
I've created a logistic regression formula regarding mpg for various makes and models of cars. One variable "origin" was integer with : 1=American, 2=German, 3=Japanese. I converted it to
origin.factor <- factor(origin,labels=c("American", "German", "Japanese")
> head(origin.factor)
[1] American American American American American
[6] American
Levels: American German Japanese
First, it was suggested I convert "origin" to factor using as.factor and relabel, but I did not see how to pass the label=c("American", "German", "Japanese") with as.factor. Any ideas?
Next, initial Logistic model with all variables yielded this output (sorry the columns are not aligning in this post, but the last column is the p-values in bold for each variable):
> auto.mpg.logistic <- glm(mpg.binary~cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family="binomial")
> summary(auto.mpg.logistic)
Call:
glm(formula = mpg.binary ~ cylinders + displacement + horsepower +
weight + acceleration + year + origin.factor, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.44937 -0.08809 0.00577 0.19315 3.03363
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.450793 5.956353 -3.266 **0.00109** **
cylinders -0.264169 0.439645 -0.601 **0.54793**
displacement 0.015568 0.013658 1.140 **0.25434**
horsepower -0.043081 0.024621 -1.750 **0.08017** .
weight -0.005762 0.001376 -4.187 **2.83e-05** ***
acceleration 0.012939 0.142921 0.091 **0.92786**
year 0.495635 0.086155 5.753 **8.78e-09** ***
origin.factorGerman 1.971277 0.785573 2.509 **0.01210** *
origin.factorJapanese 1.102741 0.713768 1.545 **0.12236**
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
next I proceeded to remove the variables where the p-value is > 0.05 level of significance to arrive at the following output:
> auto.mpg.logistic <- glm(mpg.binary~ horsepower + weight + year + origin.factor, family="binomial")
> summary(auto.mpg.logistic)
Call:
glm(formula = mpg.binary ~ horsepower + weight + year + origin.factor,
family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2675 -0.0943 0.0080 0.2007 3.2653
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.240055 4.912407 -3.713 0.000205 ***
horsepower -0.042209 0.016441 -2.567 0.010251 *
weight -0.004607 0.000734 -6.276 3.47e-10 ***
year 0.457663 0.075997 6.022 1.72e-09 ***
origin.factorGerman 1.335225 0.529879 2.520 0.011740 *
origin.factorJapanese 0.628677 0.580123 1.084 0.278500
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now the only variable that is still above the 0.05 level of significance is origin.factorJapanese
So the question is, can I somehow remove just origin.factorJapanese and leave in Origin.factorGerman since it is significant?
Or is the appropriate action to remove origin.factor which will eliminate all aspects of this categorical variable from my logistic model (this seems like my only option...)?
I'm new to R and primarily use base R functions as per our class assignments so please consider that in your answers. Thanks,
John
Solution 1:[1]
This is really about statistics more than it is about R. You have a model which has a bunch of continuous explanatory variables (horsepower
, weight
, year
), and a single factor origin.factor
. The model you are fitting is a parallel lines model. That is, for each level of origin.factor
you are fitting a hyper-plane (but just think about it as a line if it helps) with a different intercept for each country of origin.
R uses the Intercept
to fit the base level of your factor, and the remaining factor levels are really the difference between the base level and the level. Therefore what the regression summary table is telling you that German cars are different from American cars (American
is the base because it comes first alphabetically which is how R handles factors by default), but Japanese cars are not. Note it tells you nothing about the difference between German and Japanese cars.
So, you have some evidence that there are differences between the levels of the factors, but not all of them. You really don't want to try and fit the model without the Japanese level in there (well you might but not for the reasons you think).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | James Curran |