'Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value. I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:

new <- aggregate(e_date ~ id, data_full, min)

data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)

new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)

data_fixed <- data_full[which(new$comb %in% data_full$comb),]

The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.



Solution 1:[1]

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).

library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]

Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.

library(dplyr)
data_full %>%
    group_by(id) %>%
    arrange(e_date) %>%
    slice(1L)

If we need a base R option, ave can be used

data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]

Solution 2:[2]

Another answer that uses dplyr's filter command:

dta %>% 
  group_by(id) %>%
  filter(date == min(date))

Solution 3:[3]

You may use library(sqldf) to get the minimum date as follows:

data1<-data.frame(id=c("789","123","456","123","123","456","789"),
                  e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))  

library(sqldf)
data2 = sqldf("SELECT id,
                    min(e_date) as 'earliest_date'
                    FROM data1 GROUP BY 1", method = "name__class")    

head(data2)   

id   earliest_date   
123    2014-03-01      
456    2015-07-08   
789    2015-12-11  

Solution 4:[4]

I made a reproducible example, supposing that you grouped some dates by which quarter they were in.

library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)

df %>%
  group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])

# A tibble: 4 x 2
  which_quarter sort(rand_weeks)[1]
          <dbl>              <time>
1             1 2017-01-05 05:46:32
2             2 2017-04-06 05:46:32
3             3 2016-08-18 05:46:32
4             4 2016-10-06 05:46:32

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 akrun
Solution 2 greg_s
Solution 3 AndrewGB
Solution 4 shayaa