'use 'start' and 'end' values in two columns to specify fill range over remaining columns in R
I need to fill each row of a matrix with '1' between 'start' and 'end' columns, where the 'start' and 'end' column names (dates in the real data) are specified for each 'id' in two columns of the matrix.
e.g.
library(data.table)
d<- data.table(id = c("id_1","id_2"),
start.date = c(as.Date("2021-06-01"), as.Date("2021-07-02")),
end.date = c(as.Date("2021-08-04"), as.Date("2021-09-12")))
> d
id start.date end.date
1: id_1 2021-06-01 2021-08-04
2: id_2 2021-07-02 2021-09-12
The goal is to get a count of the number of individuals that fall on each date. With a smaller dataset I would do this:
expand.dates<- function(start.date, end.date){
dates<- seq.Date(start.date, end.date, "1 day")
}
##join expanded dates list to the original data.table 'd' on 'id'
xx<- d[d[,.(dates = expand.dates(start.date, end.date)), by = id], on = .(id)]
cnts<- xx[,.(counts = .N), by = .(dates)]
But the real data has several million individual IDs and the above approach leads to a memory error (cannot create vector of 8.5GB), so I am trying to 'cast' the date ranges and then run colSums across the dates to get the counts.
Solution 1:[1]
Answer to edited question
The OP has edited the question and has disclosed more of the intentions:
imagine several million distinct IDs and a full range of possible start and end dates, spanning anywhere from a few days to a few years. The goal is to get a count of individuals that fall on each date
I have solved a similar problem with help of the IRanges
package from Bioconductor:
install.packages("IRanges", repos = "https://bioconductor.org/packages/3.15/bioc")
library(IRanges)
cvr <- d[, coverage(IRanges(as.numeric(start.date), as.numeric(end.date)))]
data.table(start.date = lubridate::as_date(start(cvr)),
end.date = lubridate::as_date(end(cvr)),
count = runValue(cvr))
start.date end.date count 1: 1970-01-02 2021-05-31 0 2: 2021-06-01 2021-07-01 1 3: 2021-07-02 2021-08-04 2 4: 2021-08-05 2021-09-12 1
The result represents the time scale where each row shows the number of overlaps count
(coverage) for each subperiod.
Explanation
The input dataset
id start.date end.date 1: id_1 2021-06-01 2021-08-04 2: id_2 2021-07-02 2021-09-12
is converted to integer ranges in order to utilize the coverage()
function from IRanges
. coverage()
returns a compact run-length encoded (RLE) representation of the subperiods:
cvr
integer-Rle of length 18882 with 4 runs Lengths: 18778 31 34 39 Values : 0 1 2 1
Finally, the RLE is converted to a data.frame with the integer ranges coerced back to Date
class.
Usage
The result can be easily used in a variety of use cases:
result <- data.table(start.date = lubridate::as_date(start(cvr)),
end.date = lubridate::as_date(end(cvr)),
count = runValue(cvr))[-1]
Here, the date range has been trimmed, i.e., the first row was removed.
Plotting
library(ggplot2)
ggplot(result[]) +
aes(x = start.date, y = count, xend = end.date, yend = count) +
geom_segment()
Querying
result["2021-08-21" %between% .(start.date, end.date)]
start.date end.date count 1: 2021-08-05 2021-09-12 1
Expanding (inverse RLE)
result[, .(Date = seq(start.date, end.date, by = 1), count), by = 1:nrow(result)]
nrow Date count 1: 1 2021-06-01 1 2: 1 2021-06-02 1 3: 1 2021-06-03 1 4: 1 2021-06-04 1 5: 1 2021-06-05 1 --- 100: 3 2021-09-08 1 101: 3 2021-09-09 1 102: 3 2021-09-10 1 103: 3 2021-09-11 1 104: 3 2021-09-12 1
N.B.: With the development version 1.14.3 of data.table the code can be simplified by using by = .I
for row-wise operations.
data.table::update.dev.pkg()
library(data.table)
result[, .(Date = seq(start.date, end.date, by = 1), count), by = .I]
Answer to original question
As there are many rows and there is only a limited number of possibilities to fill in the 1
s in the matrix, my suggestion is to join with a look-up table.
lut <- fread(
"
a, b, c, d, e, f
c, d, 1, 1,NA,NA
c, e, 1, 1, 1,NA
c, f, 1, 1, 1, 1
d, e,NA, 1, 1,NA
d, f,NA, 1, 1, 1
e, f,NA,NA, 1, 1
")
lut[d, on =.(a, b), .(id, a, b, c, d, e, f)]
id a b c d e f 1: A1 c e 1 1 1 NA 2: B2 d f NA 1 1 1 3: C3 c e 1 1 1 NA 4: D4 d f NA 1 1 1
This approach is faster by magnitudes than r2evans' answer and consumes less memory. For a sample use case with 1 million rows, r2evans' approach took more than 30 seconds and allocated nearly 600 MBytes of memory while the join took less than 150 ms and allocated less than 100 MBytes of memory.
Benchmark details
library(bench)
col_names <- letters[3:6]
n_cols <- length(col_names)
lut_text <-
"a, b, c, d, e, f
c, d, 1, 1,NA,NA
c, e, 1, 1, 1,NA
c, f, 1, 1, 1, 1
d, e,NA, 1, 1,NA
d, f,NA, 1, 1, 1
e, f,NA,NA, 1, 1"
bm <- press(
n = 10^(1:6),
{
set.seed(42)
ia <- sample(1:(n_cols - 1), n, replace = TRUE)
ib <- pmin(ia + sample(1:(n_cols - 1), n, replace = TRUE), n_cols)
d <- data.table(id = 1:n,
a = col_names[ia],
b = col_names[ib]
)
for (col in col_names) {
set(d, , col, NA_integer_)
}
str(d)
mark(
r2evans = {
seq.character <- function(from, to, ...) {
letters[seq(match(tolower(from), letters),
match(tolower(to), letters), ...)]
}
newd <- rbindlist(Map(function(...) {
o <- seq.character(...)
setNames(as.list(rep(1L, length(o))), o)
}, d$a, d$b), fill = TRUE, use.names = TRUE)
cbind(d[,1:3], newd)
},
join = {
lut <- fread(text = lut_text)
lut[d, on =.(a, b), .(id, a, b, c, d, e, f)]
}
)
}
)
bm
# A tibble: 12 × 14 expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> 1 r2evans 10 868.5µs 937.7µs 1002. 1.64MB 5.32 377 2 376.1ms <data.table> <Rprofmem> 2 join 10 2.43ms 2.99ms 322. 928.97KB 4.13 156 2 483.99ms <data.table> <Rprofmem> 3 r2evans 100 3.03ms 3.24ms 289. 109.08KB 8.45 137 4 473.24ms <data.table> <Rprofmem> 4 join 100 2.44ms 2.66ms 355. 140.84KB 2.03 175 1 493.48ms <data.table> <Rprofmem> 5 r2evans 1000 26.09ms 27.11ms 35.7 803.18KB 11.0 13 4 364.26ms <data.table> <Rprofmem> 6 join 1000 2.48ms 2.67ms 359. 225.21KB 4.12 174 2 485.02ms <data.table> <Rprofmem> 7 r2evans 10000 288.68ms 299.55ms 3.34 5.95MB 8.35 2 5 599.1ms <data.table> <Rprofmem> 8 join 10000 3.59ms 4.3ms 217. 1.04MB 3.98 109 2 502.33ms <data.table> <Rprofmem> 9 r2evans 100000 3.26s 3.26s 0.307 58.48MB 5.52 1 18 3.26s <data.table> <Rprofmem> 10 join 100000 12.14ms 13.07ms 64.7 9.28MB 7.84 33 4 509.99ms <data.table> <Rprofmem> 11 r2evans 1000000 30.76s 30.76s 0.0325 583.7MB 0.845 1 26 30.76s <data.table> <Rprofmem> 12 join 1000000 74.74ms 141.19ms 1.65 91.68MB 0.826 4 2 2.42s <data.table> <Rprofmem> # … with 2 more variables: time <list>, gc <list>
ggplot2::autoplot(bm)
Note that bench::mark()
by default checks if the results are equal.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |