'replacing missing values in r
I need help in replacing missing values in the following dummy file. The following rule need to be followed when replacing a missing value.
If the value is the same on both sides of the column where the cell has a missing value, the missing value should be replaced with the value on either side.
If the value is the same on both sides of a column where two adjacent cells have a missing value, the missing value should be replaced with the value on either side.
If the value same on both sides of the column where 3, 4 or more adjacent cells have missing value, the missing value should be replaced with the value on either side of it
If the value in 2007 Colum missing, then it should be replaced with the value of 2008 and 2009 if they are the same
If a value in the 2017 Column is missing, it should be replaced with the values from 2016 and 2015, if they are the same.
If the value is not the same on both sides of the column containing the missing value, the missing value should be replaced with the most frequently occurring value between 2007 and 2017 columns.
If 2007 and 2008 missing, replace both missing value with 2009 if 2009==2010==2011
If 2007, 2008 and 2009 missing, replace all three-missing value with 2010 if 2010==2011==2012
If 2007, 2008, 2009 and 2010 missing, replace all four-missing value with 2011 if 2011==2012==2013
If 2017 and 2016 missing, replace both missing value with 2015 if 2015==2014==2013
If 2017, 2016 and 2015 missing, replace all three-missing value with 2014 if 2014==2013==2012
If 2017, 2016, 2015 and 2014 missing, replace all four-missing value with 2013 if 2013==2012==2011
create new variable of count of unique value during 2007 and 2017 for every case
dummy data is below
dput(gb)
structure(list(ID = 1:20, X2007 = c("a1", "v1", "", "e1", "d1",
"g1", "t1", "b2w", "p1", "q1", "sd1", "fr4", "fr6", "gt7", "",
"ju8", "ki9", "lo9", "", "i88"), X2008 = c("a1", "v1", "c1",
"e1", "d1", "", "t1", "b2w", "", "", "", "", "", "", "", "",
"", "", "", ""), X2009 = c("a1", "", "c1", "", "", "d1", "t1",
"", "p1", "", "sd1", "", "fr6", "", "hj7", "ju8", "ki9", "lo9",
"k99", "i88"), X2010 = c("a1", "", "", "e1", "", "d1", "", "",
"p1", "", "sd1", "", "fr6", "gt7", "hj7", "", "ki9", "", "k99",
""), X2011 = c("", "v1", "", "", "", "d1", "", "b2w", "p1", "q1",
"sd1", "", "fr6", "gt7", "hj7", "", "ki9", "", "k99", ""), X2012 = c("a1",
"v1", "c1", "e1", "", "", "", "b2w", "p1", "q1", "sd1", "", "fr6",
"gt7", "hj7", "ju8", "ki9", "lo9", "k99", ""), X2013 = c("b1 ",
"", "c1", "e1", "d1", "", "t1", "", "p1", "q1", "sd1", "fr4",
"fr6", "gt7", "hj7", "ju8", "ki9", "lo9", "k99", ""), X2014 = c("",
"v1", "", "", "d1", "g1", "t1", "", "", "q1", "", "fr4", "",
"gt7", "", "ju8", "", "lo9", "", "i88"), X2015 = c("b3", "b6",
"", "", "d1", "g1", "t1", "", "", "q1", "", "fr4", "", "", "",
"ju8", "", "lo9", "", "i88"), X2016 = c("b4", "b6", "", "", "d1",
"g1", "t1", "b2w", "", "", "", "fr4", "", "", "", "", "", "lo9",
"", "i88"), X2017 = c("b5", "b6", "c1", "e1", "d1", "g1", "",
"", "", "", "", "fr4", "", "", "", "", "", "lo9", "", "i88")), class = "data.frame", row.names = c(NA,
-20L))
Solution 1:[1]
Here is a possible approach:
- Pivot longer
- change "" to NA, and trim white space
- group by ID
- Within each group:
- use
zoo::locf
, get the "adjacent" values for any sequence of consecutive missing - get the most common value (I create a little helper function for this)
- apply the rules using
case_when()
- Pivot back to wide
most_common <- function(v) {
tv = table(v[!is.na(v)])
names(tv)[which.max(tv)]
}
gb %>%
pivot_longer(-ID,names_prefix = "X",names_transform = as.integer, names_to="year") %>%
mutate(value=if_else(value=="", as.character(NA), trimws(value))) %>%
group_by(ID) %>%
mutate(
prevv=zoo::na.locf(value, na.rm=F),
nextv=zoo::na.locf(value, na.rm=F, fromLast=T),
mostf = most_common(value),
# Now replace the NA
nvalue = case_when(
!is.na(value)~value,
prevv==nextv~prevv,
is.na(value) & year==2007 & lead(value,1) == lead(value,2)~lead(value,1),
is.na(value) & year==2017 & lag(value,1) == lag(value,2)~lag(value,1),
TRUE~mostf
)) %>%
pivot_wider(ID, names_from=year,names_prefix = "X", values_from=nvalue)
Output:
ID X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a1 a1 a1 a1 a1 a1 b1 a1 b3 b4 b5
2 2 v1 v1 v1 v1 v1 v1 v1 v1 b6 b6 b6
3 3 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1
4 4 e1 e1 e1 e1 e1 e1 e1 e1 e1 e1 e1
5 5 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1
6 6 g1 g1 d1 d1 d1 g1 g1 g1 g1 g1 g1
7 7 t1 t1 t1 t1 t1 t1 t1 t1 t1 t1 t1
8 8 b2w b2w b2w b2w b2w b2w b2w b2w b2w b2w b2w
9 9 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1 p1
10 10 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1 q1
11 11 sd1 sd1 sd1 sd1 sd1 sd1 sd1 sd1 sd1 sd1 sd1
12 12 fr4 fr4 fr4 fr4 fr4 fr4 fr4 fr4 fr4 fr4 fr4
13 13 fr6 fr6 fr6 fr6 fr6 fr6 fr6 fr6 fr6 fr6 fr6
14 14 gt7 gt7 gt7 gt7 gt7 gt7 gt7 gt7 gt7 gt7 gt7
15 15 hj7 hj7 hj7 hj7 hj7 hj7 hj7 hj7 hj7 hj7 hj7
16 16 ju8 ju8 ju8 ju8 ju8 ju8 ju8 ju8 ju8 ju8 ju8
17 17 ki9 ki9 ki9 ki9 ki9 ki9 ki9 ki9 ki9 ki9 ki9
18 18 lo9 lo9 lo9 lo9 lo9 lo9 lo9 lo9 lo9 lo9 lo9
19 19 k99 k99 k99 k99 k99 k99 k99 k99 k99 k99 k99
20 20 i88 i88 i88 i88 i88 i88 i88 i88 i88 i88 i88
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | langtang |