'Using for loop in R to combine column values based on preceeding column conditions
I am working with a big dataset with multiple observations for a certain gene, on varying dates and with varying expression levels. Data used
I would like to sum all the 'expression' column values if:
- They belong to the same gene (column 'gene' i = column 'gene' i+1)
AND
- They are measured on the same date (column 'date' i = column 'date' i+1)
The output should be something like this (each gene should have 1 observation per date, i.e. the sum of all the expression levels of that gene on that date): The_desired_output
I have tried making a for loop, but I am relatively new to R and having troubles with creating a dataframe out of the loop. An alternative solution might be better.
Thanks a lot!
Solution 1:[1]
How big is "big"? If you really have a large dataset, you are much better off with data.table
.
Here is an example with 10MM rows.
# made up example: YOU should provide this
#
set.seed(1) # for reproducible example
df <- data.frame(gene=sample(1:1e6, 1e7, replace=TRUE),
expression=rpois(1e7, 5),
date=sample(43000:44000, 1e7, replace=TRUE))
##
#
library(tictoc) # for timing functions
library(dplyr)
library(data.table)
##
#
tic()
result.1 <- df %>% group_by(gene, date) %>% summarise(expression = sum(expression))
toc()
## 40.83 sec elapsed
##
#
tic()
result.2 <- setDT(df)[, .(expression=sum(expression)), keyby=.(gene, date)]
toc()
## 3.03 sec elapsed
So data.table
is 13 times faster in this example.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | jlhoward |