'How to fill in missing values by group?

I have the following data structure. Within each group, some observations have missing value. I do know that each group has only one non-missing value (10 for group 1 and 11 for group 2 in this case). The location of the missing observations are random within the group (i.e. can't fill in missing values with the previous / following value).

How to fill the missing values with the one non-missing value by group?

group  value
1       .
1       10
1       .
2       11
2       .
2       11

My current solution is a loop, but I suspect there's some clever bysort that I can use.

levelsof group, local(lm_group)
foreach group in `lm_group' {
    levelsof value if group == `group', local(lm_value)
    replace value = `lm_value' if group == `group'
}


Solution 1:[1]

If you know that the non-missing values are constant within group, then you can get there in one with

 bysort group (value) : replace value = value[_n-1] if missing(value) 

as the missing values are first sorted to the end and then each missing value is replaced by the previous non-missing value. Replacement cascades downwards, but only within each group.

For documentation, see this FAQ

To check that there is at most one distinct non-missing value within each group, you could do this:

bysort group (value) : assert (value == value[1]) | missing(value) 

More personal note. It's nice to see levelsof in use, as I first wrote it, but the above is better.

Solution 2:[2]

I think the xfill command is what you are looking for.

To install xfill, copy-paste the following into Stata and follow instructions:

net from http://www.sealedenvelope.com/

After that, the rest is easy:

xfill value, i(group)

You can read up about xfill here

Solution 3:[3]

The clever bysort-answer you were looking for was:

bysort group: egen new_value=max(cond(!missing(value), value, .)

The cond-function checks if the first argument is true and returns value if is and . if it is not.

Solution 4:[4]

FWIW I could not get Nick's bysort solution to work, no clue why. I followed the suggested syntax from the FAQ he linked instead and got it to work, though. The generic form is:

gsort id -myvar
by id: replace myvar = myvar[_n-1] if myvar == .

EDIT: fixed the errant reference to "time" in the previous iteration of this post (and added the if missing condition). The current code should be a functioning generic solution.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nick Cox
Solution 2 Heisenberg
Solution 3 HeckOverflow
Solution 4