'How to select highly variable genes in bulk RNA seq data?
As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.) In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.
Solution 1:[1]
row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Mahmood Tavakoli |