Extracting Patterns and Identifying coexpressed Genes (EPIG)
Overview
EPIG is a method for Extracting microarray gene expression Patterns and Identifying coexpressed Genes. Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signaltonoise ratios, EPIG extracts a set of patterns representing coexpressed genes without a predefined seeding of the patterns.
Method
A compiled microarray gene expression data set (conventionally presented, the log_{2} pixel intensity ratio values) consists of a 2dimensional matrix, in which each row represents a gene expression profile and each column represents an array. Upon sample perturbation or variation in biological factors, such as agent, dose, time or tissue, a gene expression profile can be made up of intergroup and intragroup samples. The arrays in an intragroup sample have a factor in common, e.g. biological replicates. The arrays in intergroup samples possess different factors, e.g., shamtreatment and time points postUV or IR treatment. Each datum of log_{2} ratio is denoted as g_{ij} in a gene expression profile, where i refers to a intergroup index from 1 to m, j is the intragroup index from 1 to n_{i}, m is the number of intergroups and n_{i} is the number of arrays in i^{th} intergroup. To evaluate such a profile, each intragroup average and sample variance are computed. A gene expression profile’s signal is defined as
where 1 ≤ i ≤ m.. A profile’s noise estimate is defined as the squareroot of the pooled variance, i.e.
where the sample variance
From Equations 1 and 2, a profile’s signaltonoise ratio is defined as
When m = 1, Equation 3 is equivalent to a two sample ttest, since by default the log_{2} pixel intensity ratio is the treated against its control. Equation 3 includes the case for m > 1, i.e. multiple intergroups.
In extracting gene expression patterns, EPIG uses a filtering process where all profiles initially are considered as pattern candidates. The following is the pseudo code for the algorithm:
Briefly, using all pairwise correlations, any candidate profile, whose local cluster size is less than a predefined size M_{t} or its correlation with another profile is higher (> R_{t}) but has a lower local cluster size M, is removed from pattern construction consideration. Among the remaining profiles, EPIG then creates representative profiles for the corresponding local clusters and removes those profiles with a SNR in Equation 3 less than 3 or magnitude S in Equation 1 less than 0.5. After this filtering processing, the remaining profiles consist of the extracted patterns, which are used to be the representatives to each of the local clusters.
Each of the patterns has the highest local cluster size in comparison with other highly similar profiles (e.g. correlation larger than 0.8) in the same local cluster. Subsequently, EPIG categorizes each gene to the pattern, for which it has the highest correlation with the gene profile. A gene not assigned to any extracted patterns is considered an “orphan” if its highest correlation rvalue is lower that a given threshold R_{c}. Typically R_{c} is set to a value which corresponds to a correlation pvalue of 10^{4} to assure the significance of the coexpression.
Requirements
The software has been tested to run on Windows PCs running the 2000 and XP operating systems and requires JRE version 1.4.2
Downloads
 Download the EPIG archive(1MB)
 Download the Quick Start Guide(498KB)
 Report bugs, corrections and suggestions to bushel@niehs.nih.gov
Public Domain Notice
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.
Contact

Pierre R. Bushel, Ph.D.
Staff Scientist 
Tel (919) 3164564
Fax (919) 5414311
bushel@niehs.nih.gov