Skip Navigation

Your Environment. Your Health.



RNA-sequencing (RNA-seq) provides genome-wide representation of gene expression. RNA-Seq data is count-based rendering many normal distribution models for analysis inappropriate. Extracting patterns and identifying co-expressed genes (EPIG) for microarray data was adapted for RNA-seq (EPIG-Seq). To identify patterns, count-based correlation measures similarity between expression profiles, quasi-Poisson modelling estimates dispersion and a location parameter indicates the magnitude of differential expression. EPIG-Seq then categorizes genes to the patterns that they correlation with.

Workflow for EPIG-Seq

Workflow for EPIG-Seq

EPIG-Seq Implementation

Step 1: Pattern extraction


A compiled RNA-seq gene expression dataset consists of a 2-dimensional matrix, in which each row represents a gene expression profile and each column represents a sample. Denote xij as the count of reads from sample j mapped to a gene i and xkj as the count of reads from sample j mapped to a gene k. To measure the count level correlation between two gene profiles, the similarity measure for count data previously defined as

Formula for epig-seq

was adopted where

Formula for epig-seq

and a is the total number of samples with read counts mapped to either profile.

Magnitude of change

The strength of a gene expression profile’s signal is defined according to the value of the test-statistic location parameter obtained from a Wilcoxon rank sum non-parametric test measuring the difference between the ranks of the expression of the genes in sample X vs those in sample Y. Here, sample X is the biological replicates from the treated, perturbed or diseased group and sample Y is the biological replicates from the controls. The gth gene expression profile’s signal is therefore:

Formula for epig-seq

When the sample size for each group is small (i.e., ≤ 30), the estimation of the Z-statistic from the Wilcoxon rank sum test can be spurious. In such a case, the strength of the gth gene’s differential expression is the value of the Hodges-Lehmann location parameter estimator Formula for epig-seq for the difference between two groups. Briefly, Formula for epig-seq is the median of the Walsh averages; i.e., the average of all possible pairs of differences between the ranks of the gth gene RNA-Seq counts in sample X vs the ranks of the RNA-Seq counts in sample Y. Thus, for a set of N paired differences observations, there will be N(N+1)/2 Walsh averages. The median of all the Walsh averages is equal to Formula for epig-seq. Hence,

Formula for epig-seq


Count data is known to be dispersed. The variance-to-mean ratio (VMR) is a measure of dispersion ( Formula for epig-seq) and is the inverse of signal to noise. If Formula for epig-seq is larger than 1, a data set is said to be overdispersed.  For each gth gene expression profile, Formula for epig-seq is estimated using a quasi-Poisson regression to model the data. For Poisson distributed data, the variance is equal to the mean, i.e., V(Yg) = E(Yg) = Formula for epig-seq. The quasi-Poisson likelihood model is commonly used for overdispersed count data as it incorporates the dispersion parameter Formula for epig-seq into the Poisson model. In doing so, the variance of the response (Yg) is a linear function of the mean, V(Yg) = Formula for epig-seq and dispersion estimated as

Formula for epig-seq

where n is the sample size, c is the number of estimated parameters and Formula for epig-seq is an inverse function of the linear predictors. Here, the inverse function is a “log” link in the form of a generalized linear model:

Formula for epig-seq, where for the gth gene expression profile in the jth sample, Y is the read count, X is the independent variable and Formula for epig-seq is the random error term

Step 2: Categorization of gene expression profiles to patterns

Once the patterns have been extracted, the  Formula for epig-seq measure is used to correlate the ith profile to the kth pattern. The profile is assigned to the pattern to which it has the highest similarity to. Once all the profiles are assigned, a representative profile for each of the patterns is determined by the highest median correlation to the other profiles in the pattern. Briefly, for the ith gene expression profile and for the kth pattern it is assigned to, a Pattern Correlation Score

Formula for epig-seq

is computed as the median of the correlations among the ith profile (xi) to all other profiles (xj) assigned to pattern k (Pk). Until no more profiles are reassigned, the Formula for epig-seq measure is used to correlate the ith profile to the kth pattern and assign it to the pattern with which it has the highest correlation. Since the CYs does not denote the directionality of the correlation, the assignment of a profile is done if and only if, corr(ind(Zi),ind(Zj)) for all groups = 1. Here corr() is the Pearson correlation and ind() is the indication of + or – of the location parameter Z.

Reference for Citing

Li J, Bushel PR. EPIG-Seq: extracting patterns and identifying co-expressed genes from RNA-Seq data. BMC Genomics. 2016 Mar 22;17:255. [Abstract]


R version ≥ 3.1.2
CRAN R package stats version 3.1.2 to fit a generalized linear model (glm)

Matlab installation executable (MGLInstaller.exe) to install Matlab math libraries
epigseqgui.exe executable

EPIG-Seq is available at:

Report bugs, corrections and suggestions to Pierre Bushel at and Jianying Li

Public Domain Notice
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.