'PLM: Cannot add dummy variable
I am currently working on estimating a fixed-effect model using plm()
. The following table is an example of my data (please note that I used arbitrary numbers here). I ran the regression using district and year fixed-effect, and as expected, there was an error due to the duplication id-time. I, thus, merged district with grade together to obtain a unique id for the regression.
state | district | year | grade | Y | X | id |
---|---|---|---|---|---|---|
AK | 1001 | 2009 | 3 | 0.1 | 0.5 | 1001.3 |
AK | 1001 | 2010 | 3 | 0.8 | 0.4 | 1001.3 |
AK | 1001 | 2011 | 3 | 0.5 | 0.7 | 1001.3 |
AK | 1001 | 2009 | 4 | 1.5 | 1.3 | 1001.4 |
AK | 1001 | 2010 | 4 | 1.1 | 0.7 | 1001.4 |
AK | 1001 | 2011 | 4 | 2.1 | 0.4 | 1001.4 |
... | ... | ... | .. | .. | .. | ... |
WY | 5606 | 2011 | 6 | 4.2 | 5.3 | 5606.6 |
Everything went pretty well until I tried to add grade-level dummy variables in the regression. I tried with both factor()
and added the dummy variables in the equation. But both did not work out. I did not see the dummy variables in my results. Note that I showed only the first work with factor()
for the sake of conciseness. In the second regression, I generated grade-level dummy variables, i.e., g3 and g4, and put them in the regression instead of factor(grade)
. It should look like plm(formula = Y ~ X + g3 + g4,...
.
fe <- plm(formula = Y ~ X + factor(grade),
data = df,
index = c("id", "year"),
model = "within",
effect = "twoways")
summary(fe)
Twoways effects Within Model
Call:
plm(formula = Y ~ X + factor(grade), data = df,
effect = "twoways", model = "within", index = c("id",
"year"))
Unbalanced Panel: n = 64302, T = 1-10, N = 499112
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-11.35455 -0.34340 0.00000 0.34364 6.42513
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Y 0.0126717 0.0036019 3.518 0.0004348 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 173290
Residual Sum of Squares: 173280
R-Squared: 2.8464e-05
Adj. R-Squared: -0.14788
F-statistic: 12.3766 on 1 and 434800 DF, p-value: 0.00043478
Question: Why did it happen? Was it because of combining between district and id? If so, how should I fix it to get the coefficients of these dummy variables? Is plm()
a appropriate package that I should use? Any suggestions would be appreciated. Thank you!
P.S. It is definitely not a problem of multicollinearity. This post said that it was related to this problem. I followed this post, but I got false
for my results.
Solution 1:[1]
Despite the test you point to, this is definitely a collinearity problem. There is no independent information in grade
that is not already accounted for by id
. Here's a simple example. In this model, the only variable is the id
factor - which is essentially estimating the mean of y
for each value of grade
which is the intercept plus the coefficient on the dummy variable for any particular id
.
set.seed(123)
dat <- tibble(
dist = sample(LETTERS[1:10], 1000, replace=TRUE),
grade = sample(letters[17:26], 1000, replace=TRUE),
id = paste(dist, grade, sep="-"),
y = rnorm(1000)
)
mod <- lm(y ~ factor(id), data=dat)
Now, let's say we want to use the model to get the mean for any grade
, here are the grade
means of y
.
dat %>%
group_by(grade) %>%
summarise(m = mean(y))
# # A tibble: 10 × 2
# grade m
# <chr> <dbl>
# 1 q -0.0523
# 2 r -0.193
# 3 s -0.0964
# 4 t 0.0647
# 5 u -0.161
# 6 v -0.0273
# 7 w 0.0390
# 8 x 0.109
# 9 y -0.104
# 10 z 0.146
Let's try to use the model estimates to get the grade
mean of y
for grade=z
. First, let's find out what percentage of observations are in each id
group containing grade=z
:
n <- dat %>%
group_by(id) %>%
tally() %>%
filter(str_detect(id, "z$")) %>%
mutate(pct = n/sum(n))
n
# # A tibble: 10 × 3
# id n pct
# <chr> <int> <dbl>
# 1 A-z 11 0.112
# 2 B-z 10 0.102
# 3 C-z 13 0.133
# 4 D-z 6 0.0612
# 5 E-z 12 0.122
# 6 F-z 8 0.0816
# 7 G-z 9 0.0918
# 8 H-z 10 0.102
# 9 I-z 7 0.0714
# 10 J-z 12 0.122
We can now collecting the intercept and the id
values that contain grade=z
:
ests <- broom::tidy(mod) %>%
filter(str_detect(term, "ntercept|z")) %>%
mutate(term = gsub("factor\\(id\\)", "", term)) %>%
select(1,2)
ests
# # A tibble: 11 × 2
# term estimate
# <chr> <dbl>
# 1 (Intercept) -0.391
# 2 A-z 0.264
# 3 B-z 0.572
# 4 C-z 0.520
# 5 D-z 0.863
# 6 E-z 0.774
# 7 F-z 0.755
# 8 G-z 0.591
# 9 H-z 0.0538
# 10 I-z -0.0657
# 11 J-z 0.951
We can then join these data to the percentages from above and replace the percentage for the intercept
term to 1 because wa want to add the intercept to the weighted average of the group coefficients:
ests <- ests %>%
left_join(n %>% rename(term = id)) %>%
mutate(pct = ifelse(is.na(pct), 1, pct))
ests
# # A tibble: 11 × 4
# term estimate n pct
# <chr> <dbl> <int> <dbl>
# 1 (Intercept) -0.391 NA 1
# 2 A-z 0.264 11 0.112
# 3 B-z 0.572 10 0.102
# 4 C-z 0.520 13 0.133
# 5 D-z 0.863 6 0.0612
# 6 E-z 0.774 12 0.122
# 7 F-z 0.755 8 0.0816
# 8 G-z 0.591 9 0.0918
# 9 H-z 0.0538 10 0.102
# 10 I-z -0.0657 7 0.0714
# 11 J-z 0.951 12 0.122
Finally, we can just sum the estimate
column multiplied by the pct
column:
ests %>%
summarise(m = sum(pct*estimate))
# # A tibble: 1 × 1
# m
# <dbl>
# 1 0.146
Note that this is exactly the same value that we calculated for the grade=z
mean from above. This suggests that we can perfectly recover the grade
means of y
by using the id
coefficients, which means that we cannot estimate the independent effect of grade
once id
is already accounted for because of perfect collinearity.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | DaveArmstrong |