RNA-sequencing (RNA-seq) provides genome-wide representation of gene expression. RNA-Seq data is count-based rendering many normal distribution models for analysis inappropriate. Extracting patterns and identifying co-expressed genes (EPIG) for microarray data was adapted for RNA-seq (EPIG-Seq). To identify patterns, count-based correlation measures similarity between expression profiles, quasi-Poisson modelling estimates dispersion and a location parameter indicates the magnitude of differential expression. EPIG-Seq then categorizes genes to the patterns that they correlation with.
Workflow for EPIG-Seq
Step 1: Pattern extraction
A compiled RNA-seq gene expression dataset consists of a 2-dimensional matrix, in which each row represents a gene expression profile and each column represents a sample. Denote xij as the count of reads from sample j mapped to a gene i and xkj as the count of reads from sample j mapped to a gene k. To measure the count level correlation between two gene profiles, the similarity measure for count data previously defined as
was adopted where
and a is the total number of samples with read counts mapped to either profile.
Magnitude of change
The strength of a gene expression profile’s signal is defined according to the value of the test-statistic location parameter obtained from a Wilcoxon rank sum non-parametric test measuring the difference between the ranks of the expression of the genes in sample X vs those in sample Y. Here, sample X is the biological replicates from the treated, perturbed or diseased group and sample Y is the biological replicates from the controls. The gth gene expression profile’s signal is therefore:
When the sample size for each group is small (i.e., ≤ 30), the estimation of the Z-statistic from the Wilcoxon rank sum test can be spurious. In such a case, the strength of the gth gene’s differential expression is the value of the Hodges-Lehmann location parameter estimator
Count data is known to be dispersed. The variance-to-mean ratio (VMR) is a measure of dispersion (
where n is the sample size, c is the number of estimated parameters and
Step 2: Categorization of gene expression profiles to patterns
Once the patterns have been extracted, the
is computed as the median of the correlations among the ith profile (xi) to all other profiles (xj) assigned to pattern k (Pk). Until no more profiles are reassigned, the
Reference for Citing
Li J, Bushel PR. EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data. BMC Genomics. 2016 Mar 22;17:255. [Abstract]
R version ≥ 3.1.2
CRAN R package stats version 3.1.2 to fit a generalized linear model (glm)
Matlab installation executable (MGLInstaller.exe) to install Matlab math libraries
EPIG-Seq is available at: http://sourceforge.net/projects/epig-seq/
Public Domain Notice
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.