'How to reduce a categorical variable in a logistic regression model in R

I've created a logistic regression formula regarding mpg for various makes and models of cars. One variable "origin" was integer with : 1=American, 2=German, 3=Japanese. I converted it to

origin.factor <- factor(origin,labels=c("American", "German", "Japanese")
> head(origin.factor)
[1] American American American American American
[6] American
Levels: American German Japanese

First, it was suggested I convert "origin" to factor using as.factor and relabel, but I did not see how to pass the label=c("American", "German", "Japanese") with as.factor. Any ideas?

Next, initial Logistic model with all variables yielded this output (sorry the columns are not aligning in this post, but the last column is the p-values in bold for each variable):

> auto.mpg.logistic <- glm(mpg.binary~cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family="binomial")
> summary(auto.mpg.logistic)

Call:
glm(formula = mpg.binary ~ cylinders + displacement + horsepower + 
    weight + acceleration + year + origin.factor, family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.44937  -0.08809   0.00577   0.19315   3.03363  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -19.450793   5.956353  -3.266  **0.00109** ** 

cylinders              -0.264169   0.439645  -0.601  **0.54793**   
displacement            0.015568   0.013658   1.140  **0.25434**   
horsepower             -0.043081   0.024621  -1.750  **0.08017** .  
weight                 -0.005762   0.001376  -4.187 **2.83e-05** ***

acceleration            0.012939   0.142921   0.091  **0.92786**    
year                    0.495635   0.086155   5.753 **8.78e-09** ***

origin.factorGerman     1.971277   0.785573   2.509  **0.01210** * 
 
origin.factorJapanese   1.102741   0.713768   1.545  **0.12236**    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

next I proceeded to remove the variables where the p-value is > 0.05 level of significance to arrive at the following output:

> auto.mpg.logistic <- glm(mpg.binary~ horsepower + weight + year + origin.factor, family="binomial")
> summary(auto.mpg.logistic)

Call:
glm(formula = mpg.binary ~ horsepower + weight + year + origin.factor, 
    family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2675  -0.0943   0.0080   0.2007   3.2653  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -18.240055   4.912407  -3.713 0.000205 ***

horsepower             -0.042209   0.016441  -2.567 0.010251 * 
 
weight                 -0.004607   0.000734  -6.276 3.47e-10 ***

year                    0.457663   0.075997   6.022 1.72e-09 ***

origin.factorGerman     1.335225   0.529879   2.520 0.011740 *  

origin.factorJapanese   0.628677   0.580123   1.084 0.278500    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now the only variable that is still above the 0.05 level of significance is origin.factorJapanese

So the question is, can I somehow remove just origin.factorJapanese and leave in Origin.factorGerman since it is significant?

Or is the appropriate action to remove origin.factor which will eliminate all aspects of this categorical variable from my logistic model (this seems like my only option...)?

I'm new to R and primarily use base R functions as per our class assignments so please consider that in your answers. Thanks,

John



Solution 1:[1]

This is really about statistics more than it is about R. You have a model which has a bunch of continuous explanatory variables (horsepower, weight, year), and a single factor origin.factor. The model you are fitting is a parallel lines model. That is, for each level of origin.factor you are fitting a hyper-plane (but just think about it as a line if it helps) with a different intercept for each country of origin.

R uses the Intercept to fit the base level of your factor, and the remaining factor levels are really the difference between the base level and the level. Therefore what the regression summary table is telling you that German cars are different from American cars (American is the base because it comes first alphabetically which is how R handles factors by default), but Japanese cars are not. Note it tells you nothing about the difference between German and Japanese cars.

So, you have some evidence that there are differences between the levels of the factors, but not all of them. You really don't want to try and fit the model without the Japanese level in there (well you might but not for the reasons you think).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 James Curran