'Flagging continuous observations and creating enrolment spans

I have a few large enrolment datasets and I'm trying to create two things:

  1. I'd like to flag each uninterrupted monthly observation (final_df1)
  2. I'd like to create a dataset of uninterrupted spans (final_df2)

For example:

library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
           date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))

final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))

final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
                  span_start=c(ymd("20120101")%m+%months(1),
                               ymd("20120101")%m+%months(7),
                               ymd("20120401")%m+%months(1),
                               ymd("20120101")%m+%months(12),
                               ymd("20120101")%m+%months(19)),
                  span_end=c(ymd("20120101")%m+%months(5),
                             ymd("20120101")%m+%months(11),
                             ymd("20120101")%m+%months(10),
                             ymd("20120101")%m+%months(17),
                             ymd("20120101")%m+%months(22))
                  )

I feel like there must be a simple way to do this between {lubridate} and {data.table} but I'm drawing up blanks. Any tips?



Solution 1:[1]

Grouped by 'id', create an interval with the previous value of 'date' (lag) and the current 'date', divide by the months, check if it is less than 2, and take the cumulative minimum (cummin). After creating the 'find_df_new', then we group by 'id' and the run-length-id of 'cont_enroll' column, and summarise with the first and last value of 'date' to create the 'span_start' and 'span_end' respectively

library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
   group_by(id)  %>%
   mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)), 
       date) /months(1) < 2))  %>%
   ungroup
final_df_new %>% 
  group_by(id, grp = rleid(cont_enroll2))  %>%
  summarise(span_start = first(date), span_end = last(date), .groups = 'drop')

Solution 2:[2]

I think you can solve this nicely with the ivs package. Your dates seem to really represent 1 month intervals, and the ivs package is dedicated to working with data of this type.

We can compute final_df2 with iv_groups(), which returns the non-overlapping intervals that remain after merging all overlapping intervals.

Then the first row of final_df2 per group represents the first continuous interval, so you just need to check if each range is within that interval or not to decide if it is part of the uninterrupted set to get final_df1.

Note that my final_df2 looks different from yours, is it possible that you have an error in how you coded it?

library(dplyr)
library(lubridate)
library(ivs)

df <- tibble(
  id = c(
    rep("X", 10), 
    rep("Y", 20)
  ),
  date = c(
    ymd("20120101") %m+% months(c(1:5,7:11)), 
    ymd("20120401") %m+% months(c(1:10,12:17,19:22))
  )
)
df
#> # A tibble: 30 × 2
#>    id    date      
#>    <chr> <date>    
#>  1 X     2012-02-01
#>  2 X     2012-03-01
#>  3 X     2012-04-01
#>  4 X     2012-05-01
#>  5 X     2012-06-01
#>  6 X     2012-08-01
#>  7 X     2012-09-01
#>  8 X     2012-10-01
#>  9 X     2012-11-01
#> 10 X     2012-12-01
#> # … with 20 more rows

df <- df %>%
  mutate(start = date, end = date + months(1), .keep = "unused") %>%
  mutate(range = iv(start, end), .keep = "unused")

df
#> # A tibble: 30 × 2
#>    id                       range
#>    <chr>               <iv<date>>
#>  1 X     [2012-02-01, 2012-03-01)
#>  2 X     [2012-03-01, 2012-04-01)
#>  3 X     [2012-04-01, 2012-05-01)
#>  4 X     [2012-05-01, 2012-06-01)
#>  5 X     [2012-06-01, 2012-07-01)
#>  6 X     [2012-08-01, 2012-09-01)
#>  7 X     [2012-09-01, 2012-10-01)
#>  8 X     [2012-10-01, 2012-11-01)
#>  9 X     [2012-11-01, 2012-12-01)
#> 10 X     [2012-12-01, 2013-01-01)
#> # … with 20 more rows

# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
  group_by(id) %>%
  summarise(range = iv_groups(range), .groups = "drop")

continuous
#> # A tibble: 5 × 2
#>   id                       range
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 X     [2012-08-01, 2013-01-01)
#> 3 Y     [2012-05-01, 2013-03-01)
#> 4 Y     [2013-04-01, 2013-10-01)
#> 5 Y     [2013-11-01, 2014-03-01)

# The first continuous range per id
first_continuous <- continuous %>%
  group_by(id) %>%
  slice(1) %>%
  ungroup() %>%
  rename(range_continuous = range)

first_continuous
#> # A tibble: 2 × 2
#>   id            range_continuous
#>   <chr>               <iv<date>>
#> 1 X     [2012-02-01, 2012-07-01)
#> 2 Y     [2012-05-01, 2013-03-01)

# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
  mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#>    id                       range         range_continuous continuous
#>    <chr>               <iv<date>>               <iv<date>> <lgl>     
#>  1 X     [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE      
#>  2 X     [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE      
#>  3 X     [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE      
#>  4 X     [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE      
#>  5 X     [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE      
#>  6 X     [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE     
#>  7 X     [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE     
#>  8 X     [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE     
#>  9 X     [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE     
#> 10 X     [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE     
#> # … with 20 more rows

Created on 2022-05-13 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Davis Vaughan