'How can I find the nearest date after another date in a different column in grouping by ID using R?

I am looking for a method that will look at each date in "Date A" and find the next nearest date after that value in "Date B" by ID (group_by). I then want to calculate the difference in days. Below is the table that I would like.

ID    |  Date A    |  Date B     | Difference| 
11111 |  09/01/21  |  09/03/21   | 2         |
22222 |  09/06/21  |  09/20/21   | 11        | 
11111 |  09/08/21  |  09/18/21   | 10        |  
44444 |  09/04/21  |  NA         | 11        | 
44444 |  09/10/21  |  09/15/21   | 5         |
22222 |  NA        |  09/17/21   | NA        | 
77777 |  NA        |  10/16/21   | NA        |
77777 |  09/04/21  |  10/17/21   | 24        |
77777 |  09/01/21  |  09/28/21   | 27        |

If you could please help me out with this, I would greatly appreciate it!

Cheers



Solution 1:[1]

A dplyr solution via group_by solution is not obvious to me here, but here is a relatively straightforward sqldf solution. Presumably this could be translated into a dplyr solution if you really wanted.

First mock up the data within R

df <- dplyr::tribble(
  ~'ID', ~'Date A', ~'Date B',
  11111,  '09/01/21', '09/03/21',
  22222,  '09/06/21', '09/20/21',
  11111,  '09/08/21', '09/18/21',
  44444,  '09/04/21', NA        ,
  44444,  '09/10/21', '09/15/21',
  22222,  NA        , '09/17/21',
  77777,  NA        , '10/16/21',
  77777,  '09/04/21', '10/17/21',
  77777,  '09/01/21', '09/28/21'
)
df$`Date A` <- lubridate::mdy(df$`Date A`)
df$`Date B` <- lubridate::mdy(df$`Date B`)
df

Which looks like

# A tibble: 9 x 3
     ID `Date A`   `Date B`  
  <dbl> <date>     <date>    
1 11111 2021-09-01 2021-09-03
2 22222 2021-09-06 2021-09-20
3 11111 2021-09-08 2021-09-18
4 44444 2021-09-04 NA        
5 44444 2021-09-10 2021-09-15
6 22222 NA         2021-09-17
7 77777 NA         2021-10-16
8 77777 2021-09-04 2021-10-17
9 77777 2021-09-01 2021-09-28

Then do an inequality join combined with a group by. The column I is added to allow for nuances of the data such as multiple of the same Date A within each ID

df$I <- 1:nrow(df)

df <- sqldf::sqldf('
SELECT a.I, a.ID, a."Date A", a."Date B",
  MIN(b."Date B") AS NextB
FROM df a
LEFT JOIN df b
  ON a.ID = b.ID
  AND a."Date A" < b."Date B"
GROUP BY a.I, a.ID, a."Date A", a."Date B"
ORDER BY a.I
')

df$Difference = df$NextB - as.integer(df$`Date A`)
df$I <- NULL
df$NextB <- NULL
df

Which matches your example data (and should generalize well for edge cases not in your example data). Unclear how well it might scale up to non-trivial data.

     ID     Date A     Date B Difference
1 11111 2021-09-01 2021-09-03          2
2 22222 2021-09-06 2021-09-20         11
3 11111 2021-09-08 2021-09-18         10
4 44444 2021-09-04       <NA>         11
5 44444 2021-09-10 2021-09-15          5
6 22222       <NA> 2021-09-17         NA
7 77777       <NA> 2021-10-16         NA
8 77777 2021-09-04 2021-10-17         24
9 77777 2021-09-01 2021-09-28         27

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jmuhlenkamp