'use 'start' and 'end' values in two columns to specify fill range over remaining columns in R

I need to fill each row of a matrix with '1' between 'start' and 'end' columns, where the 'start' and 'end' column names (dates in the real data) are specified for each 'id' in two columns of the matrix.

e.g.

library(data.table)

d<- data.table(id = c("id_1","id_2"), 
               start.date = c(as.Date("2021-06-01"), as.Date("2021-07-02")), 
               end.date = c(as.Date("2021-08-04"), as.Date("2021-09-12")))

> d
     id start.date   end.date
1: id_1 2021-06-01 2021-08-04
2: id_2 2021-07-02 2021-09-12

The goal is to get a count of the number of individuals that fall on each date. With a smaller dataset I would do this:


expand.dates<- function(start.date, end.date){
  dates<- seq.Date(start.date, end.date, "1 day")
}



##join expanded dates list to the original data.table 'd' on 'id'
xx<- d[d[,.(dates = expand.dates(start.date, end.date)), by = id], on = .(id)]

cnts<- xx[,.(counts = .N), by = .(dates)]

But the real data has several million individual IDs and the above approach leads to a memory error (cannot create vector of 8.5GB), so I am trying to 'cast' the date ranges and then run colSums across the dates to get the counts.



Solution 1:[1]

Answer to edited question

The OP has edited the question and has disclosed more of the intentions:

imagine several million distinct IDs and a full range of possible start and end dates, spanning anywhere from a few days to a few years. The goal is to get a count of individuals that fall on each date

I have solved a similar problem with help of the IRanges package from Bioconductor:

install.packages("IRanges", repos = "https://bioconductor.org/packages/3.15/bioc")
library(IRanges)
cvr <- d[, coverage(IRanges(as.numeric(start.date), as.numeric(end.date)))]
data.table(start.date = lubridate::as_date(start(cvr)),
           end.date = lubridate::as_date(end(cvr)),
           count = runValue(cvr))
   start.date   end.date count
1: 1970-01-02 2021-05-31     0
2: 2021-06-01 2021-07-01     1
3: 2021-07-02 2021-08-04     2
4: 2021-08-05 2021-09-12     1

The result represents the time scale where each row shows the number of overlaps count (coverage) for each subperiod.

Explanation

The input dataset

     id start.date   end.date
1: id_1 2021-06-01 2021-08-04
2: id_2 2021-07-02 2021-09-12

is converted to integer ranges in order to utilize the coverage() function from IRanges. coverage() returns a compact run-length encoded (RLE) representation of the subperiods:

cvr
integer-Rle of length 18882 with 4 runs
  Lengths: 18778    31    34    39
  Values :     0     1     2     1

Finally, the RLE is converted to a data.frame with the integer ranges coerced back to Date class.

Usage

The result can be easily used in a variety of use cases:

result <- data.table(start.date = lubridate::as_date(start(cvr)),
           end.date = lubridate::as_date(end(cvr)),
           count = runValue(cvr))[-1]

Here, the date range has been trimmed, i.e., the first row was removed.

Plotting

library(ggplot2)
ggplot(result[]) +
  aes(x = start.date, y = count, xend = end.date, yend = count) +
  geom_segment()

enter image description here

Querying

result["2021-08-21" %between% .(start.date, end.date)]
   start.date   end.date count
1: 2021-08-05 2021-09-12     1

Expanding (inverse RLE)

result[, .(Date = seq(start.date, end.date, by = 1), count), by = 1:nrow(result)]
     nrow       Date count
  1:    1 2021-06-01     1
  2:    1 2021-06-02     1
  3:    1 2021-06-03     1
  4:    1 2021-06-04     1
  5:    1 2021-06-05     1
 ---                      
100:    3 2021-09-08     1
101:    3 2021-09-09     1
102:    3 2021-09-10     1
103:    3 2021-09-11     1
104:    3 2021-09-12     1

N.B.: With the development version 1.14.3 of data.table the code can be simplified by using by = .I for row-wise operations.

data.table::update.dev.pkg()
library(data.table)
result[, .(Date = seq(start.date, end.date, by = 1), count), by = .I]

Answer to original question

As there are many rows and there is only a limited number of possibilities to fill in the 1s in the matrix, my suggestion is to join with a look-up table.

lut <- fread(
"
a, b, c, d, e, f
c, d, 1, 1,NA,NA
c, e, 1, 1, 1,NA
c, f, 1, 1, 1, 1
d, e,NA, 1, 1,NA
d, f,NA, 1, 1, 1
e, f,NA,NA, 1, 1
")

lut[d, on =.(a, b), .(id, a, b, c, d, e, f)]
   id a b  c d e  f
1: A1 c e  1 1 1 NA
2: B2 d f NA 1 1  1
3: C3 c e  1 1 1 NA
4: D4 d f NA 1 1  1

This approach is faster by magnitudes than r2evans' answer and consumes less memory. For a sample use case with 1 million rows, r2evans' approach took more than 30 seconds and allocated nearly 600 MBytes of memory while the join took less than 150 ms and allocated less than 100 MBytes of memory.

Benchmark details

library(bench)
col_names <- letters[3:6]
n_cols <- length(col_names)
lut_text <- 
"a, b, c, d, e, f
c, d, 1, 1,NA,NA
c, e, 1, 1, 1,NA
c, f, 1, 1, 1, 1
d, e,NA, 1, 1,NA
d, f,NA, 1, 1, 1
e, f,NA,NA, 1, 1"
bm <- press(
  n = 10^(1:6),
  {
    set.seed(42)
    ia <- sample(1:(n_cols - 1), n, replace = TRUE)
    ib <- pmin(ia + sample(1:(n_cols - 1), n, replace = TRUE), n_cols) 
    d <- data.table(id = 1:n,
                    a = col_names[ia],
                    b = col_names[ib]
    )
    for (col in col_names) {
      set(d, , col, NA_integer_)
    }
    str(d)
    mark(
      r2evans = {
        seq.character <- function(from, to, ...) {
          letters[seq(match(tolower(from), letters), 
                      match(tolower(to), letters), ...)]
        }
        newd <- rbindlist(Map(function(...) {
          o <- seq.character(...)
          setNames(as.list(rep(1L, length(o))), o)
        }, d$a, d$b), fill = TRUE, use.names = TRUE)
        cbind(d[,1:3], newd)
      },
      join = {
        lut <- fread(text = lut_text)
        lut[d, on =.(a, b), .(id, a, b, c, d, e, f)]
      }
    )
  }
)
bm
# A tibble: 12 × 14
   expression       n      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory    
   <bch:expr>   <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>    
 1 r2evans         10  868.5µs  937.7µs 1002.        1.64MB    5.32    377     2    376.1ms <data.table> <Rprofmem>
 2 join            10   2.43ms   2.99ms  322.      928.97KB    4.13    156     2   483.99ms <data.table> <Rprofmem>
 3 r2evans        100   3.03ms   3.24ms  289.      109.08KB    8.45    137     4   473.24ms <data.table> <Rprofmem>
 4 join           100   2.44ms   2.66ms  355.      140.84KB    2.03    175     1   493.48ms <data.table> <Rprofmem>
 5 r2evans       1000  26.09ms  27.11ms   35.7     803.18KB   11.0      13     4   364.26ms <data.table> <Rprofmem>
 6 join          1000   2.48ms   2.67ms  359.      225.21KB    4.12    174     2   485.02ms <data.table> <Rprofmem>
 7 r2evans      10000 288.68ms 299.55ms    3.34      5.95MB    8.35      2     5    599.1ms <data.table> <Rprofmem>
 8 join         10000   3.59ms    4.3ms  217.        1.04MB    3.98    109     2   502.33ms <data.table> <Rprofmem>
 9 r2evans     100000    3.26s    3.26s    0.307    58.48MB    5.52      1    18      3.26s <data.table> <Rprofmem>
10 join        100000  12.14ms  13.07ms   64.7       9.28MB    7.84     33     4   509.99ms <data.table> <Rprofmem>
11 r2evans    1000000   30.76s   30.76s    0.0325   583.7MB    0.845     1    26     30.76s <data.table> <Rprofmem>
12 join       1000000  74.74ms 141.19ms    1.65     91.68MB    0.826     4     2      2.42s <data.table> <Rprofmem>
# … with 2 more variables: time <list>, gc <list>
ggplot2::autoplot(bm)

enter image description here

Note that bench::mark() by default checks if the results are equal.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1