ORIOGEN
(O
rder R
estricted I
nference for O
rdered G
ene E
xpressioN
)
and multidimensional pairwise comparisons
Developed by:
Shyamal D. Peddada
Biostatistics
Branch
National
Institute of Environmental Health Sciences
National
Institute of Health
peddada@niehs.nih.gov
Programmed
by:
John Zajd and Shawn
Harris
SRA
International, Inc.
shawn_harris@sra.com
ORIOGEN Version 3.04.01 Release Description
ORIOGEN is a user-friendly Java-based software
package for selecting and clustering genes according to their time-course or
dose-response profiles.
This is a JAVA based software package that can be
used for the following purposes:
1.
Comparison of high dimensional data (e.g. gene expression) among
two or more ordered experimental conditions, such as in dose-response studies or time-course
experiments. This software can be used when the data are independent among experimental
conditions or they are dependent as in repeated measurement designs. The underlying methodology is described in Peddada
et al. (2003, 2005, 201).
Since the methodology is based on bootstrapping the residuals, this software may not be suitable if the sample
size per group is very small (e.g. 3), especially when the data between groups
are correlated. For computational efficiency, this methodology uses
adaptive bootstrap as described in Guo and Peddada (2008). This methodology attempts to control the false
discovery rate (FDR).
2.
Pairwise comparisons of high dimensional data (e.g. gene expression) among
two or more experimental conditions. Pairs to be compared are chose a priori by the
user. This software can be used when the data are
independent among experimental conditions.
The methodology not only controls for the overall
false discovery rate for making all desired pairwise comparisons, but also
controls for the error committed in the direction of inequality between groups for each differentially
expressed variable (e.g. gene). Thus the methodology controls for mixed
directional FDR (mdFDR). This software
is based on the methodology described in Guo, Sarkar and Peddada (2010).
It is based on the methodology developed in Peddada
et al. (2003) and refined in Guo and Peddada (2008) and Peddada et al. (2010).
Version 3.0 of ORIOGEN has the following advantages over its predecessor
versions:
The user pre-specifies a list of profiles (or
patterns) of mean gene expression over time/dose that may be of interest for a
specific experiment. The present version of ORIOGEN can detect increasing,
decreasing, umbrella-shaped or inverted-umbrella-shaped patterns and cyclic
patterns (up to one cycle only).
Here the word mean refers to the population
mean (which is unknown) and not the sample mean, which is calculated from given
data and provides an estimate of the population mean. Thus the profiles are
described in terms of the population means. Note that, since sample mean is a
random realization from a population, the observed sample mean expressions over
time/dose may not conform exactly to a pattern of mean expression satisfied by
the population means. For example, the experimenter may be interested in
selecting a gene whose mean expression increases with time/dose (known as increasing
shape). However, due to the randomness in the data, the observed sample
means may not necessarily have an increasing profile. Similarly, an
experimenter may be interested in selecting genes where the mean expression
increases with time/dose up to a certain point and then decreases (i.e. umbrella
shape). Due to randomness in the data, the sample means may not necessarily
follow this pattern.
ORIOGEN does not normalize the data, so it is
recommended that the user pre-process the data by applying a suitable
normalization method before submitting the data to ORIOGEN. ORIOGEN selects
genes, based on a statistical decision rule with a pre-specified level of
significance, and clusters each selected gene into an appropriate
"best-fitting" pattern or profile. The methodology can be briefly
described as follows.
ORIOGEN expresses each pre-specified profile in
terms of mathematical inequalities (known as order restrictions) between the
mean expressions. Then using the methodology developed in Hwang and Peddada
(1994) it fits each pre-specified profile to each gene. Thus for a given gene,
ORIOGEN computes a "goodness-of-fit" statistic for each candidate
profile. For a gene g, it then tests for the significance using a minor
modification to the statistic obtained in Step 3 of Peddada et al (2003). This
modification replaces Step 7 described in Peddada et al (2003). As in the
well-known SAM methodology we include a "fudge factor" s0.
This factor is calculated as the nth percentile of the SSE values for all of
the genes, where the "n" is input by the user. Typical values for s0
are 10% for repeated measures data and 0% (disabled) for ordinary data. The
modified test statistic is
where, for gene g, sg
is the pooled sample standard deviation for all time points/dose groups, and nj
and nk are the number of replicates at the
endpoints of the region.
The P-values are obtained using bootstrap
methodology by bootstrapping the residuals. The resulting bootstrap methodology
is valid if repeated measurements are made on each subject over time or if,
within each gene, the data are heteroscedastic over time. For each gene the
number of bootstrap samples are selected adaptively using the methodology
provided in Guo and Peddada (2008). This modification results in a substantial
reduction in computation time while maintaining the desired FDR level.
Once a gene is declared significant by the above
process, it is assigned to the profile with the largest goodness-of-fit
statistic.
The results for significant genes are saved to a
text file and their profile fitted means as well as the raw sample means are
displayed graphically. Q-values for each selected gene are calculated and
stored in the output file (Storey, 2002). Gene ontology information for
significant genes is provided where available.
Remarks:
Acknowledgements
We thank Drs. Leping
Li, David M. Umbach and Clarice Weinberg, Biostatistics Branch, NIEHS, for numerous
discussions and their feedback during the preparation of this an earlier version of this software.
References
Efron, B. and Tibshirani, R. (1993). An
Introduction to the Bootstrap. Chapman and Hall Inc, New York, NY.
Guo,
W., and Peddada, SD. (2008). Adaptive Choice of the Number of Bootstrap Samples
in Large Scale Multiple Testing. Statistical Applications in Genetics and
Molecular Biology, 7 (1), Art. 13.
Guo W, Sarkar SK, Peddada, SD* (2010). Controlling False Discoveries in Multidimensional
Directional Decisions, with Applications to Gene Expression Data on Ordered
Categories. Biometrics, 66, 485 - 492.
Hwang, J. and Peddada, S. (1994). Confidence
interval estimation subject to order restrictions. Annals of Statistics,
22, 67-93.
Liu, D., Umbach, D. M., Peddada, S. D., Li, L.,
Crockett, P., and Weinberg, C. (2004). A Random-Periods Model for Expression of
Cell-Cycle Genes. Proceedings
of National Academy of Sciences,
101, No. 19, 7240-7245.
Peddada,
S., Lobenhofer, E., Li, L., Afshari, C., Weinberg, C., and Umbach. D. M.
(2003). Gene selection and clustering for time-course and dose-response
microarray experiments using order restricted inference. Bioinformatics,
7, 834-841.
Peddada SD, Harris S, Zajd
J and Harvey E. (2005). ORIOGEN: Order Restricted Inference for Ordered Gene
Expression data. Bioinformatics, 21, 3933-3934.3934..
Peddada SD, Harris SF, Davidov O (2010). Analysis of Correlated Gene Expression Data on
Ordered Categories. Journal of the Indian Society of Agricultural
Statistics Indian Society of Agricultural Statistics. 64(1):45-60. Epub 2010/01/01. PubMed PMID:
21998487; PubMed Central PMCID: PMC3190572.
Peddada, SD, Harris, SF, and Davidov, O. (2010)
Analysis of Correlated Gene Expression Data on Ordered Categories. Journal
of the Indian Society of Agricultural Statistics , to appear
Peddada SD, Harris S,
Zajd J and Harvey E. (2005). ORIOGEN: Order Restricted Inference for Ordered
Gene Expression data. Bioinformatics, 21, 3933-3934.
Storey, J.D. (2002). A direct approach to false
discovery rates. J. R. Stat. Soc. B, 64, 479-498.
[PS1]Can you please fix this?