'Filling a column based on the value of another column in data.table

I have data as follows:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_2", "cat_2", "cat_2", "cat_2", 
"cat_2")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

    amount_of_categories municipality  type
 1:                    2       Area A cat_1
 2:                    2       Area A cat_1
 3:                    2       Area A cat_1
 4:                    2       Area A cat_1
 5:                    2       Area A cat_1
 6:                    2       Area A cat_1
 7:                    2       Area A cat_1
 8:                    2       Area A cat_1
 9:                    2       Area A cat_1
10:                    2       Area A cat_1
11:                    2       Area A cat_1
12:                    2       Area A  <NA>
13:                    2       Area A cat_2
14:                    1       Area B  <NA>
15:                    1       Area B  <NA>
16:                    1       Area B cat_2
17:                    1       Area B cat_2
18:                    1       Area B cat_2
19:                    1       Area B cat_2
20:                    1       Area B cat_2

The idea is to create a new column type_estimation, which replaces the NA's in the type column, with the right type. The right type can only be established if there is only one category (amount_of_categories==1) for that Area. So it should fill the last two NA's but not the first.

I tried:

dat <- setDT(dat)[is.na(type) & amount_of_categories==1, type_estimation:= shift(type), by="municipality"]

But this does not work. What is right syntax here?

Desired outcome:

    amount_of_categories municipality  type  type_estimation
 1:                    2       Area A cat_1            cat_1
 2:                    2       Area A cat_1            cat_1
 3:                    2       Area A cat_1            cat_1
 4:                    2       Area A cat_1            cat_1
 5:                    2       Area A cat_1            cat_1
 6:                    2       Area A cat_1            cat_1
 7:                    2       Area A cat_1            cat_1
 8:                    2       Area A cat_1            cat_1
 9:                    2       Area A cat_1            cat_1
10:                    2       Area A cat_1            cat_1
11:                    2       Area A cat_1            cat_1
12:                    2       Area A  <NA>             <NA> 
13:                    2       Area A cat_2            cat_2
14:                    1       Area B  <NA>            cat_2
15:                    1       Area B  <NA>            cat_2
16:                    1       Area B cat_2            cat_2
17:                    1       Area B cat_2            cat_2
18:                    1       Area B cat_2            cat_2
19:                    1       Area B cat_2            cat_2
20:                    1       Area B cat_2            cat_2

EDIT:

I tried to come up with a situation in which the solution provided by Waldi could lead to an issue. After thinking about it for a bit I realised this would be the case if:

  1. dat[,estimation:=zoo::na.locf(type)] fills with the wrong type, because the last observation is of Area A is carried forward, to the first observation of Area B and
  2. Area B has only one category, so that [amount_of_categories!=1&is.na(type) ,estimation:=NA][] does make this value NA.

In the example data:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_3", "cat_3", "cat_3", "cat_3", 
"cat_3")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

   amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_2
15:                    1       Area B  <NA>      cat_2
16:                    1       Area B cat_3      cat_3
17:                    1       Area B cat_3      cat_3
18:                    1       Area B cat_3      cat_3
19:                    1       Area B cat_3      cat_3
20:                    1       Area B cat_3      cat_3

As Waldi already pointed out, this issue cannot be fixed by using:

dat[,estimation:=zoo::na.locf(type), by="municipality"][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

Any solution for this issue would still be greatly appreciated.



Solution 1:[1]

In two steps:

dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

    amount_of_categories municipality   type estimation
                   <int>       <fctr> <char>     <char>
 1:                    2       Area A  cat_1      cat_1
 2:                    2       Area A  cat_1      cat_1
 3:                    2       Area A  cat_1      cat_1
 4:                    2       Area A  cat_1      cat_1
 5:                    2       Area A  cat_1      cat_1
 6:                    2       Area A  cat_1      cat_1
 7:                    2       Area A  cat_1      cat_1
 8:                    2       Area A  cat_1      cat_1
 9:                    2       Area A  cat_1      cat_1
10:                    2       Area A  cat_1      cat_1
11:                    2       Area A  cat_1      cat_1
12:                    2       Area A   <NA>       <NA>
13:                    2       Area A  cat_2      cat_2
14:                    1       Area B   <NA>      cat_2
15:                    1       Area B   <NA>      cat_2
16:                    1       Area B  cat_2      cat_2
17:                    1       Area B  cat_2      cat_2
18:                    1       Area B  cat_2      cat_2
19:                    1       Area B  cat_2      cat_2
20:                    1       Area B  cat_2      cat_2
    amount_of_categories municipality   type estimation

Note that I used zoo::na.locf because data.table::nafill(type='locf') doesn't yet work with characters.

An alternative approach by municipality with na.fill following you edit (example 2):

dat[,estimation:=zoo::na.fill(type,fill=type[which.max(!is.na(type))]),by=municipality][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

   amount_of_categories municipality   type estimation
                   <int>       <fctr> <char>     <char>
 1:                    2       Area A  cat_1      cat_1
 2:                    2       Area A  cat_1      cat_1
 3:                    2       Area A  cat_1      cat_1
 4:                    2       Area A  cat_1      cat_1
 5:                    2       Area A  cat_1      cat_1
 6:                    2       Area A  cat_1      cat_1
 7:                    2       Area A  cat_1      cat_1
 8:                    2       Area A  cat_1      cat_1
 9:                    2       Area A  cat_1      cat_1
10:                    2       Area A  cat_1      cat_1
11:                    2       Area A  cat_1      cat_1
12:                    2       Area A   <NA>       <NA>
13:                    2       Area A  cat_2      cat_2
14:                    1       Area B   <NA>      cat_3
15:                    1       Area B   <NA>      cat_3
16:                    1       Area B  cat_3      cat_3
17:                    1       Area B  cat_3      cat_3
18:                    1       Area B  cat_3      cat_3
19:                    1       Area B  cat_3      cat_3
20:                    1       Area B  cat_3      cat_3
    amount_of_categories municipality   type estimation

Solution 2:[2]

Does this approach leveraging unique() and join help with both cases?

unique(
  dat[amount_of_categories==1 & !is.na(type), .(municipality,type_estimation=type)]
)[dat, on=.(municipality)][is.na(type_estimation),type_estimation:=type][]

Output for example 1:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_2                    1   <NA>
15:       Area B           cat_2                    1   <NA>
16:       Area B           cat_2                    1  cat_2
17:       Area B           cat_2                    1  cat_2
18:       Area B           cat_2                    1  cat_2
19:       Area B           cat_2                    1  cat_2
20:       Area B           cat_2                    1  cat_2

Output for example 2:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_3                    1   <NA>
15:       Area B           cat_3                    1   <NA>
16:       Area B           cat_3                    1  cat_3
17:       Area B           cat_3                    1  cat_3
18:       Area B           cat_3                    1  cat_3
19:       Area B           cat_3                    1  cat_3
20:       Area B           cat_3                    1  cat_3

Solution 3:[3]

An alternative approach is to create a look-up table for the relevant cases which is used in an update join:

library(data.table)
lut <- setDT(dat)[amount_of_categories == 1, first(na.omit(type)), by = municipality]
dat[, estimation := type][lut, on = .(municipality), estimation := V1][]

Result for example 1

    amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_2
15:                    1       Area B  <NA>      cat_2
16:                    1       Area B cat_2      cat_2
17:                    1       Area B cat_2      cat_2
18:                    1       Area B cat_2      cat_2
19:                    1       Area B cat_2      cat_2
20:                    1       Area B cat_2      cat_2

Result for example 2

    amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_3
15:                    1       Area B  <NA>      cat_3
16:                    1       Area B cat_3      cat_3
17:                    1       Area B cat_3      cat_3
18:                    1       Area B cat_3      cat_3
19:                    1       Area B cat_3      cat_3
20:                    1       Area B cat_3      cat_3

Explanation

  1. For each municipality with only one category the first non-NA element of type is picked for the look-up table lut.
  2. A new column estimate is created in dat as a full copy of type.
  3. In the update join all entries in estimate are replaced by the value from lut only for the matching municipality.

This approach is somehow similar to langtang's answer but differs in implementation details.

N.B.: Updating column type directly

The OP has requested to create a separate column estimate. However, column type can be updated directly which streamlines the code:

dat[lut, on = .(municipality), type := V1][]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 langtang
Solution 3