'Flagging continuous observations and creating enrolment spans
I have a few large enrolment datasets and I'm trying to create two things:
- I'd like to flag each uninterrupted monthly observation (
final_df1
) - I'd like to create a dataset of uninterrupted spans (
final_df2
)
For example:
library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))
final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))
final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
span_start=c(ymd("20120101")%m+%months(1),
ymd("20120101")%m+%months(7),
ymd("20120401")%m+%months(1),
ymd("20120101")%m+%months(12),
ymd("20120101")%m+%months(19)),
span_end=c(ymd("20120101")%m+%months(5),
ymd("20120101")%m+%months(11),
ymd("20120101")%m+%months(10),
ymd("20120101")%m+%months(17),
ymd("20120101")%m+%months(22))
)
I feel like there must be a simple way to do this between {lubridate} and {data.table} but I'm drawing up blanks. Any tips?
Solution 1:[1]
Grouped by 'id', create an interval
with the previous value of 'date' (lag
) and the current 'date', divide by the months
, check if it is less than 2, and take the cumulative minimum (cummin
). After creating the 'find_df_new', then we group by 'id' and the run-length-id of 'cont_enroll' column, and summarise
with the first
and last
value of 'date' to create the 'span_start' and 'span_end' respectively
library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
group_by(id) %>%
mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)),
date) /months(1) < 2)) %>%
ungroup
final_df_new %>%
group_by(id, grp = rleid(cont_enroll2)) %>%
summarise(span_start = first(date), span_end = last(date), .groups = 'drop')
Solution 2:[2]
I think you can solve this nicely with the ivs package. Your dates seem to really represent 1 month intervals, and the ivs package is dedicated to working with data of this type.
We can compute final_df2
with iv_groups()
, which returns the non-overlapping intervals that remain after merging all overlapping intervals.
Then the first row of final_df2
per group represents the first continuous interval, so you just need to check if each range is within that interval or not to decide if it is part of the uninterrupted set to get final_df1
.
Note that my final_df2
looks different from yours, is it possible that you have an error in how you coded it?
library(dplyr)
library(lubridate)
library(ivs)
df <- tibble(
id = c(
rep("X", 10),
rep("Y", 20)
),
date = c(
ymd("20120101") %m+% months(c(1:5,7:11)),
ymd("20120401") %m+% months(c(1:10,12:17,19:22))
)
)
df
#> # A tibble: 30 × 2
#> id date
#> <chr> <date>
#> 1 X 2012-02-01
#> 2 X 2012-03-01
#> 3 X 2012-04-01
#> 4 X 2012-05-01
#> 5 X 2012-06-01
#> 6 X 2012-08-01
#> 7 X 2012-09-01
#> 8 X 2012-10-01
#> 9 X 2012-11-01
#> 10 X 2012-12-01
#> # … with 20 more rows
df <- df %>%
mutate(start = date, end = date + months(1), .keep = "unused") %>%
mutate(range = iv(start, end), .keep = "unused")
df
#> # A tibble: 30 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-03-01)
#> 2 X [2012-03-01, 2012-04-01)
#> 3 X [2012-04-01, 2012-05-01)
#> 4 X [2012-05-01, 2012-06-01)
#> 5 X [2012-06-01, 2012-07-01)
#> 6 X [2012-08-01, 2012-09-01)
#> 7 X [2012-09-01, 2012-10-01)
#> 8 X [2012-10-01, 2012-11-01)
#> 9 X [2012-11-01, 2012-12-01)
#> 10 X [2012-12-01, 2013-01-01)
#> # … with 20 more rows
# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
group_by(id) %>%
summarise(range = iv_groups(range), .groups = "drop")
continuous
#> # A tibble: 5 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 X [2012-08-01, 2013-01-01)
#> 3 Y [2012-05-01, 2013-03-01)
#> 4 Y [2013-04-01, 2013-10-01)
#> 5 Y [2013-11-01, 2014-03-01)
# The first continuous range per id
first_continuous <- continuous %>%
group_by(id) %>%
slice(1) %>%
ungroup() %>%
rename(range_continuous = range)
first_continuous
#> # A tibble: 2 × 2
#> id range_continuous
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 Y [2012-05-01, 2013-03-01)
# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#> id range range_continuous continuous
#> <chr> <iv<date>> <iv<date>> <lgl>
#> 1 X [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE
#> 2 X [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE
#> 3 X [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE
#> 4 X [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE
#> 5 X [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE
#> 6 X [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE
#> 7 X [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE
#> 8 X [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE
#> 9 X [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE
#> 10 X [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE
#> # … with 20 more rows
Created on 2022-05-13 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Davis Vaughan |