'PLM: Cannot add dummy variable

I am currently working on estimating a fixed-effect model using plm(). The following table is an example of my data (please note that I used arbitrary numbers here). I ran the regression using district and year fixed-effect, and as expected, there was an error due to the duplication id-time. I, thus, merged district with grade together to obtain a unique id for the regression.

state district year grade Y X id
AK 1001 2009 3 0.1 0.5 1001.3
AK 1001 2010 3 0.8 0.4 1001.3
AK 1001 2011 3 0.5 0.7 1001.3
AK 1001 2009 4 1.5 1.3 1001.4
AK 1001 2010 4 1.1 0.7 1001.4
AK 1001 2011 4 2.1 0.4 1001.4
... ... ... .. .. .. ...
WY 5606 2011 6 4.2 5.3 5606.6

Everything went pretty well until I tried to add grade-level dummy variables in the regression. I tried with both factor() and added the dummy variables in the equation. But both did not work out. I did not see the dummy variables in my results. Note that I showed only the first work with factor() for the sake of conciseness. In the second regression, I generated grade-level dummy variables, i.e., g3 and g4, and put them in the regression instead of factor(grade). It should look like plm(formula = Y ~ X + g3 + g4,....

fe <- plm(formula = Y ~ X + factor(grade),
      data = df,
      index = c("id", "year"),
      model = "within",
      effect = "twoways")
summary(fe)

Twoways effects Within Model
Call:
plm(formula = Y ~ X + factor(grade), data = df, 
effect = "twoways", model = "within", index = c("id", 
    "year"))
Unbalanced Panel: n = 64302, T = 1-10, N = 499112
Residuals:
Min.   1st Qu.    Median   3rd Qu.      Max. 
-11.35455  -0.34340   0.00000   0.34364   6.42513 
Coefficients:
   Estimate Std. Error t-value  Pr(>|t|)    
Y 0.0126717  0.0036019   3.518 0.0004348 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares:    173290
Residual Sum of Squares: 173280
R-Squared:      2.8464e-05
Adj. R-Squared: -0.14788
F-statistic: 12.3766 on 1 and 434800 DF, p-value: 0.00043478

Question: Why did it happen? Was it because of combining between district and id? If so, how should I fix it to get the coefficients of these dummy variables? Is plm() a appropriate package that I should use? Any suggestions would be appreciated. Thank you!

P.S. It is definitely not a problem of multicollinearity. This post said that it was related to this problem. I followed this post, but I got false for my results.



Solution 1:[1]

Despite the test you point to, this is definitely a collinearity problem. There is no independent information in grade that is not already accounted for by id. Here's a simple example. In this model, the only variable is the id factor - which is essentially estimating the mean of y for each value of grade which is the intercept plus the coefficient on the dummy variable for any particular id.

set.seed(123)
dat <- tibble(
  dist = sample(LETTERS[1:10], 1000, replace=TRUE), 
  grade = sample(letters[17:26], 1000, replace=TRUE), 
  id = paste(dist, grade, sep="-"), 
  y = rnorm(1000)
)

mod <- lm(y ~ factor(id), data=dat)

Now, let's say we want to use the model to get the mean for any grade, here are the grade means of y.

dat %>% 
  group_by(grade) %>% 
  summarise(m = mean(y))
# # A tibble: 10 × 2
#  grade       m
#  <chr>   <dbl>
# 1 q     -0.0523
# 2 r     -0.193 
# 3 s     -0.0964
# 4 t      0.0647
# 5 u     -0.161 
# 6 v     -0.0273
# 7 w      0.0390
# 8 x      0.109 
# 9 y     -0.104 
# 10 z      0.146 

Let's try to use the model estimates to get the grade mean of y for grade=z. First, let's find out what percentage of observations are in each id group containing grade=z:

n <- dat %>% 
  group_by(id) %>% 
  tally() %>% 
  filter(str_detect(id, "z$")) %>% 
  mutate(pct = n/sum(n))
n
# # A tibble: 10 × 3
#  id        n    pct
#  <chr> <int>  <dbl>
# 1 A-z      11 0.112 
# 2 B-z      10 0.102 
# 3 C-z      13 0.133 
# 4 D-z       6 0.0612
# 5 E-z      12 0.122 
# 6 F-z       8 0.0816
# 7 G-z       9 0.0918
# 8 H-z      10 0.102 
# 9 I-z       7 0.0714
# 10 J-z     12 0.122 

We can now collecting the intercept and the id values that contain grade=z:

ests <- broom::tidy(mod) %>% 
  filter(str_detect(term, "ntercept|z")) %>% 
  mutate(term = gsub("factor\\(id\\)", "", term)) %>% 
  select(1,2) 
ests
# # A tibble: 11 × 2
#   term        estimate
#   <chr>          <dbl>
# 1 (Intercept)  -0.391 
# 2 A-z           0.264 
# 3 B-z           0.572 
# 4 C-z           0.520 
# 5 D-z           0.863 
# 6 E-z           0.774 
# 7 F-z           0.755 
# 8 G-z           0.591 
# 9 H-z           0.0538
# 10 I-z          -0.0657
# 11 J-z           0.951 

We can then join these data to the percentages from above and replace the percentage for the intercept term to 1 because wa want to add the intercept to the weighted average of the group coefficients:

ests <- ests %>% 
  left_join(n %>% rename(term = id)) %>% 
  mutate(pct = ifelse(is.na(pct), 1, pct)) 
ests
# # A tibble: 11 × 4
#   term        estimate     n    pct
#   <chr>          <dbl> <int>  <dbl>
# 1 (Intercept)  -0.391     NA 1     
# 2 A-z           0.264     11 0.112 
# 3 B-z           0.572     10 0.102 
# 4 C-z           0.520     13 0.133 
# 5 D-z           0.863      6 0.0612
# 6 E-z           0.774     12 0.122 
# 7 F-z           0.755      8 0.0816
# 8 G-z           0.591      9 0.0918
# 9 H-z           0.0538    10 0.102 
# 10 I-z          -0.0657     7 0.0714
# 11 J-z           0.951     12 0.122 

Finally, we can just sum the estimate column multiplied by the pct column:

ests %>% 
  summarise(m = sum(pct*estimate))
# # A tibble: 1 × 1
#  m
#  <dbl>
# 1 0.146

Note that this is exactly the same value that we calculated for the grade=z mean from above. This suggests that we can perfectly recover the grade means of y by using the id coefficients, which means that we cannot estimate the independent effect of grade once id is already accounted for because of perfect collinearity.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DaveArmstrong