'Filling a column based on the value of another column in data.table
I have data as follows:
dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A",
"Area B"), class = "factor"), type= c("cat_1", "cat_1",
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1",
"cat_1", NA, "cat_2", NA, NA, "cat_2", "cat_2", "cat_2", "cat_2",
"cat_2")), class = c("data.table", "data.frame"), row.names = c(NA,
-20L))
amount_of_categories municipality type
1: 2 Area A cat_1
2: 2 Area A cat_1
3: 2 Area A cat_1
4: 2 Area A cat_1
5: 2 Area A cat_1
6: 2 Area A cat_1
7: 2 Area A cat_1
8: 2 Area A cat_1
9: 2 Area A cat_1
10: 2 Area A cat_1
11: 2 Area A cat_1
12: 2 Area A <NA>
13: 2 Area A cat_2
14: 1 Area B <NA>
15: 1 Area B <NA>
16: 1 Area B cat_2
17: 1 Area B cat_2
18: 1 Area B cat_2
19: 1 Area B cat_2
20: 1 Area B cat_2
The idea is to create a new column type_estimation
, which replaces the NA
's in the type
column, with the right type. The right type can only be established if there is only one category (amount_of_categories==1
) for that Area
. So it should fill the last two NA
's but not the first.
I tried:
dat <- setDT(dat)[is.na(type) & amount_of_categories==1, type_estimation:= shift(type), by="municipality"]
But this does not work. What is right syntax here?
Desired outcome:
amount_of_categories municipality type type_estimation
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
EDIT:
I tried to come up with a situation in which the solution provided by Waldi could lead to an issue. After thinking about it for a bit I realised this would be the case if:
dat[,estimation:=zoo::na.locf(type)]
fills with the wrong type, because the last observation is ofArea A
is carried forward, to the first observation ofArea B
andArea B
has only one category, so that[amount_of_categories!=1&is.na(type) ,estimation:=NA][]
does make this valueNA
.
In the example data:
dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A",
"Area B"), class = "factor"), type= c("cat_1", "cat_1",
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1",
"cat_1", NA, "cat_2", NA, NA, "cat_3", "cat_3", "cat_3", "cat_3",
"cat_3")), class = c("data.table", "data.frame"), row.names = c(NA,
-20L))
amount_of_categories municipality type estimation
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_3 cat_3
17: 1 Area B cat_3 cat_3
18: 1 Area B cat_3 cat_3
19: 1 Area B cat_3 cat_3
20: 1 Area B cat_3 cat_3
As Waldi already pointed out, this issue cannot be fixed by using:
dat[,estimation:=zoo::na.locf(type), by="municipality"][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
Any solution for this issue would still be greatly appreciated.
Solution 1:[1]
In two steps:
dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
amount_of_categories municipality type estimation
Note that I used zoo::na.locf
because data.table::nafill(type='locf')
doesn't yet work with characters.
An alternative approach by municipality
with na.fill
following you edit (example 2):
dat[,estimation:=zoo::na.fill(type,fill=type[which.max(!is.na(type))]),by=municipality][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_3
15: 1 Area B <NA> cat_3
16: 1 Area B cat_3 cat_3
17: 1 Area B cat_3 cat_3
18: 1 Area B cat_3 cat_3
19: 1 Area B cat_3 cat_3
20: 1 Area B cat_3 cat_3
amount_of_categories municipality type estimation
Solution 2:[2]
Does this approach leveraging unique()
and join help with both cases?
unique(
dat[amount_of_categories==1 & !is.na(type), .(municipality,type_estimation=type)]
)[dat, on=.(municipality)][is.na(type_estimation),type_estimation:=type][]
Output for example 1:
municipality type_estimation amount_of_categories type
<fctr> <char> <int> <char>
1: Area A cat_1 2 cat_1
2: Area A cat_1 2 cat_1
3: Area A cat_1 2 cat_1
4: Area A cat_1 2 cat_1
5: Area A cat_1 2 cat_1
6: Area A cat_1 2 cat_1
7: Area A cat_1 2 cat_1
8: Area A cat_1 2 cat_1
9: Area A cat_1 2 cat_1
10: Area A cat_1 2 cat_1
11: Area A cat_1 2 cat_1
12: Area A <NA> 2 <NA>
13: Area A cat_2 2 cat_2
14: Area B cat_2 1 <NA>
15: Area B cat_2 1 <NA>
16: Area B cat_2 1 cat_2
17: Area B cat_2 1 cat_2
18: Area B cat_2 1 cat_2
19: Area B cat_2 1 cat_2
20: Area B cat_2 1 cat_2
Output for example 2:
municipality type_estimation amount_of_categories type
<fctr> <char> <int> <char>
1: Area A cat_1 2 cat_1
2: Area A cat_1 2 cat_1
3: Area A cat_1 2 cat_1
4: Area A cat_1 2 cat_1
5: Area A cat_1 2 cat_1
6: Area A cat_1 2 cat_1
7: Area A cat_1 2 cat_1
8: Area A cat_1 2 cat_1
9: Area A cat_1 2 cat_1
10: Area A cat_1 2 cat_1
11: Area A cat_1 2 cat_1
12: Area A <NA> 2 <NA>
13: Area A cat_2 2 cat_2
14: Area B cat_3 1 <NA>
15: Area B cat_3 1 <NA>
16: Area B cat_3 1 cat_3
17: Area B cat_3 1 cat_3
18: Area B cat_3 1 cat_3
19: Area B cat_3 1 cat_3
20: Area B cat_3 1 cat_3
Solution 3:[3]
An alternative approach is to create a look-up table for the relevant cases which is used in an update join:
library(data.table)
lut <- setDT(dat)[amount_of_categories == 1, first(na.omit(type)), by = municipality]
dat[, estimation := type][lut, on = .(municipality), estimation := V1][]
Result for example 1
amount_of_categories municipality type estimation 1: 2 Area A cat_1 cat_1 2: 2 Area A cat_1 cat_1 3: 2 Area A cat_1 cat_1 4: 2 Area A cat_1 cat_1 5: 2 Area A cat_1 cat_1 6: 2 Area A cat_1 cat_1 7: 2 Area A cat_1 cat_1 8: 2 Area A cat_1 cat_1 9: 2 Area A cat_1 cat_1 10: 2 Area A cat_1 cat_1 11: 2 Area A cat_1 cat_1 12: 2 Area A <NA> <NA> 13: 2 Area A cat_2 cat_2 14: 1 Area B <NA> cat_2 15: 1 Area B <NA> cat_2 16: 1 Area B cat_2 cat_2 17: 1 Area B cat_2 cat_2 18: 1 Area B cat_2 cat_2 19: 1 Area B cat_2 cat_2 20: 1 Area B cat_2 cat_2
Result for example 2
amount_of_categories municipality type estimation 1: 2 Area A cat_1 cat_1 2: 2 Area A cat_1 cat_1 3: 2 Area A cat_1 cat_1 4: 2 Area A cat_1 cat_1 5: 2 Area A cat_1 cat_1 6: 2 Area A cat_1 cat_1 7: 2 Area A cat_1 cat_1 8: 2 Area A cat_1 cat_1 9: 2 Area A cat_1 cat_1 10: 2 Area A cat_1 cat_1 11: 2 Area A cat_1 cat_1 12: 2 Area A <NA> <NA> 13: 2 Area A cat_2 cat_2 14: 1 Area B <NA> cat_3 15: 1 Area B <NA> cat_3 16: 1 Area B cat_3 cat_3 17: 1 Area B cat_3 cat_3 18: 1 Area B cat_3 cat_3 19: 1 Area B cat_3 cat_3 20: 1 Area B cat_3 cat_3
Explanation
- For each
municipality
with only one category the first non-NA element oftype
is picked for the look-up tablelut
. - A new column
estimate
is created indat
as a full copy oftype
. - In the update join all entries in
estimate
are replaced by the value fromlut
only for the matchingmunicipality
.
This approach is somehow similar to langtang's answer but differs in implementation details.
N.B.: Updating column type
directly
The OP has requested to create a separate column estimate
. However, column type can be updated directly which streamlines the code:
dat[lut, on = .(municipality), type := V1][]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | langtang |
Solution 3 |