'How to fill in missing values by group?
I have the following data structure. Within each group, some observations have missing value
. I do know that each group has only one non-missing value (10
for group 1 and 11
for group 2 in this case). The location of the missing observations are random within the group (i.e. can't fill in missing values with the previous / following value).
How to fill the missing values with the one non-missing value by group?
group value
1 .
1 10
1 .
2 11
2 .
2 11
My current solution is a loop, but I suspect there's some clever bysort
that I can use.
levelsof group, local(lm_group)
foreach group in `lm_group' {
levelsof value if group == `group', local(lm_value)
replace value = `lm_value' if group == `group'
}
Solution 1:[1]
If you know that the non-missing values are constant within group, then you can get there in one with
bysort group (value) : replace value = value[_n-1] if missing(value)
as the missing values are first sorted to the end and then each missing value is replace
d by the previous non-missing value. Replacement cascades downwards, but only within each group.
For documentation, see this FAQ
To check that there is at most one distinct non-missing value within each group, you could do this:
bysort group (value) : assert (value == value[1]) | missing(value)
More personal note. It's nice to see levelsof
in use, as I first wrote it, but the above is better.
Solution 2:[2]
I think the xfill command is what you are looking for.
To install xfill
, copy-paste the following into Stata and follow instructions:
net from http://www.sealedenvelope.com/
After that, the rest is easy:
xfill value, i(group)
You can read up about xfill
here
Solution 3:[3]
The clever bysort-answer you were looking for was:
bysort group: egen new_value=max(cond(!missing(value), value, .)
The cond-function checks if the first argument is true and returns value if is and . if it is not.
Solution 4:[4]
FWIW I could not get Nick's bysort solution to work, no clue why. I followed the suggested syntax from the FAQ he linked instead and got it to work, though. The generic form is:
gsort id -myvar
by id: replace myvar = myvar[_n-1] if myvar == .
EDIT: fixed the errant reference to "time" in the previous iteration of this post (and added the if missing condition). The current code should be a functioning generic solution.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Nick Cox |
Solution 2 | Heisenberg |
Solution 3 | HeckOverflow |
Solution 4 |