'Using for loop in R to combine column values based on preceeding column conditions

I am working with a big dataset with multiple observations for a certain gene, on varying dates and with varying expression levels. Data used

I would like to sum all the 'expression' column values if:

  1. They belong to the same gene (column 'gene' i = column 'gene' i+1)

AND

  1. They are measured on the same date (column 'date' i = column 'date' i+1)

The output should be something like this (each gene should have 1 observation per date, i.e. the sum of all the expression levels of that gene on that date): The_desired_output

I have tried making a for loop, but I am relatively new to R and having troubles with creating a dataframe out of the loop. An alternative solution might be better.

Thanks a lot!



Solution 1:[1]

How big is "big"? If you really have a large dataset, you are much better off with data.table.

Here is an example with 10MM rows.

#   made up example: YOU should provide this
#
set.seed(1)    # for reproducible example
df <- data.frame(gene=sample(1:1e6, 1e7, replace=TRUE), 
                 expression=rpois(1e7, 5), 
                 date=sample(43000:44000, 1e7, replace=TRUE))
##
#
library(tictoc)       # for timing functions
library(dplyr)
library(data.table)
##
#
tic()
result.1 <- df %>% group_by(gene, date) %>% summarise(expression = sum(expression))
toc()
## 40.83 sec elapsed
##
#
tic()
result.2 <- setDT(df)[, .(expression=sum(expression)), keyby=.(gene, date)]
toc()
## 3.03 sec elapsed

So data.table is 13 times faster in this example.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jlhoward