'R How can I substract date time rows from each other dependent on Participant ID?

Starting with disclaimers here: I am completely new to R and have already read other questions & answers here concerning similar topics but can't seem to find one that helps my problem, so please forgive me if I just wasn't able identify a correct and applicable answer that may already exists, due to my inexperience. Another warning before you continure reading: I have tried my best to create a minimal reproducible example and have failed thus far, so now I am hoping my question might also be answered without it - but I obviously understand if this basis is too cumbersome for anyone to continue with. (However, I have deposited a minimal version of my csv data table with just two participants and their time date data, in case anyone would like to get an impression of the data: Link to my dropbox containing a small version of the csv data table).

My main problem here is that I would like to substract date time data in one column from each other, in order to fill another row with data called t0-t28, but dependent on/ nested for Participant ID.

Background: I've got a data set ("workdata") with about 1700 columns and 380 rows, results from a repeated measures psychology study. For the study, participants filled out five different questionnaires: Onboarding (at t0), Daily_A (at t1-t6 and t8-t13), Day_7 (at t7), Day14 (at t14) and Day28 (at t28, follow-up). Each participant has a unique identifier, their Participant ID. Each Participant ID may appear up to 16 times in the data set, if the regarding participant has filled out all questionnaires mentioned above. (Which is not always the case, since the adherence of some participants was as low as 20%).

So far: A friend and I have been able to tell R to identify the date time values as such (referring to the SCHEDULED_TS column), using the lubridate package:

date_sorted <- dmy_hm(workdata$SCHEDULED_TS)
workdata <- cbind(workdata,date_sorted)

Also, we have managed to sort all rows by participant ID and date time, so now all repeated measures per participant are displayed as the rows underneath each other in date time order.

Data sorted by date time and participant ID, but without column displaying t0-t28 per participant

Then, we wanted to create a new column that displays t0-t28 for each participants data row. This is needed because not all participants have started their study participation on the same day – so March 14 might be t0 for one participant but t9 for another. I would like to be able to compare t1 for all participants regardless of whether t1 was March 5th or March 15th for any participant.

We thought the easiest way to creating and filling this t0-t28 column would be by identifying t0 (“Onboarding” time) per participant and then substracting its date time from the date times of all other measures in order to receive t1, t2, t3 and so on. We have tried this by

t <- length(workdata$date_sorted)
t <- ifelse(workdata$SURVEY_NAME == "Onboarding", "t0", NA)
workdata <- cbind(workdata, t)

and several strategies that I accidentaly deleted, so I can't display them here anymore - I am really sorry my first post here is so messy.

All we did receive though is that R took the first “Onboarding” date time in the entire data set and subtracted it from all the other date times, regardless of the corresponding Participant ID.

Does anyone here have a tip on how to solve this and create t0-t28 nested per participant? Thank you very much for your kind help.

r


Solution 1:[1]

Here's how you can do it using the dplyr package. The steps are:

  1. Calculate date_sorted (keeping only the date part)
  2. Group the rows by PARTICIPANT_ID
  3. Then within each group, subtract the minimum date from the current date and prefix it with 't'.
  4. Ungroup the rows again

Note that Day28 is on t27 if starting from 0.

library(dplyr)
library(lubridate)

df <- workdata %>%
  mutate(date_sorted = date(dmy_hm(SCHEDULED_TS))) %>%
  group_by(PARTICIPANT_ID) %>%
  mutate(t = paste0('t', date_sorted - min(date_sorted))) %>%
  ungroup()
  
df
#> # A tibble: 32 x 9
#>    Filter.1 source                     PARTICIPANT_ID Groups SURVEY_NAME START_END SCHEDULED_TS    date_sorted t    
#>       <dbl> <chr>                      <chr>          <chr>  <chr>           <dbl> <chr>           <date>      <chr>
#>  1        1 Onboarding Export 26042022 s014620782     Exp    Onboarding          1 17-Mar-22 17:29 2022-03-17  t0   
#>  2        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 18-Mar-22 18:48 2022-03-18  t1   
#>  3        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 19-Mar-22 19:29 2022-03-19  t2   
#>  4        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 20-Mar-22 19:17 2022-03-20  t3   
#>  5        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 21-Mar-22 18:43 2022-03-21  t4   
#>  6        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 22-Mar-22 18:39 2022-03-22  t5   
#>  7        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 23-Mar-22 19:16 2022-03-23  t6   
#>  8        1 Day_7 Export 26042022      s014620782     Exp    Day_7               1 24-Mar-22 18:44 2022-03-24  t7   
#>  9        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 25-Mar-22 18:46 2022-03-25  t8   
#> 10        1 Daily_A Export 26042022    s014620782     Exp    Daily_A             1 26-Mar-22 18:44 2022-03-26  t9   
#> # … with 22 more rows

Solution 2:[2]

Thank you again @Aron for your great answer. My supervisor now answered my question with a slightly different code, which I would like to share here in case anyone finds it useful in the future (all credit goes to my supervisor though!). She also found out that the reason why the t values were not calculated with the previous code was because in some cases, there were NA values in the SCHEDULED_TS column. Here is her final code:

library(tidyverse)
library(lubridate)
library(dplyr)

date_sorted <- dmy_hm(workdata$SCHEDULED_TS)
workdata <- cbind(workdata,date_sorted)
idx1 <- order(workdata$PARTICIPANT_ID, workdata$date_sorted)
workdata <- workdata[idx1, ]


workdata$SCHEDULED_TS <- dmy_hm(workdata$SCHEDULED_TS)


#Add measurement day relative to onboarding

df <- workdata %>%
 filter(!is.na(SCHEDULED_TS)) %>%  # remove NA values
group_by(PARTICIPANT_ID) %>%      # group by participant
mutate(t = paste0('t', round(as.numeric(interval(min(SCHEDULED_TS), SCHEDULED_TS), 'days')))) %>% # calculate interval in days
ungroup()

I hope this may help anyone with a similar problem in the future. Thank's to everyone who answered my question!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Aron
Solution 2 Antje Stueven