ORIOGEN (O rder R estricted I nference for O rdered G ene E xpressioN )

and multidimensional pairwise comparisons

 

Developed by:

Shyamal D. Peddada
Biostatistics Branch
National Institute of Environmental Health Sciences
National Institute of Health
peddada@niehs.nih.gov

Programmed by:

John Zajd and Shawn Harris
SRA International, Inc.
shawn_harris@sra.com

© Copyright 2004-2010[PS1] 

ORIOGEN Version 3.04.01 Release Description

ORIOGEN is a user-friendly Java-based software package for selecting and clustering genes according to their time-course or dose-response profiles.

This is a JAVA based software package that can be used for the following purposes:

 

1.      Comparison of high dimensional data (e.g. gene expression) among two or more ordered experimental conditions, such as in dose-response studies or time-course experiments. This software can be used when the data are independent among experimental conditions or they are dependent as in repeated measurement designs.   The underlying methodology is described in Peddada et al. (2003, 2005, 201).  Since the methodology is based on bootstrapping the residuals, this software may not be suitable if the sample size per group is very small (e.g. 3), especially when the data between groups are correlated.  For computational efficiency, this methodology uses adaptive bootstrap as described in Guo and Peddada (2008).   This methodology attempts to control the false discovery rate (FDR).

 

2.      Pairwise comparisons of high dimensional data (e.g. gene expression) among two or more experimental conditions. Pairs to be compared are chose a priori by the user. This software can be used when the data are independent among experimental conditions.  The methodology not only controls for the overall false discovery rate for making all desired pairwise comparisons, but also controls for the error committed in the direction of inequality between groups for each differentially expressed variable (e.g. gene). Thus the methodology controls for mixed directional FDR (mdFDR).  This software is based on the methodology described in Guo, Sarkar and Peddada (2010).

 

 

It is based on the methodology developed in Peddada et al. (2003) and refined in Guo and Peddada (2008) and Peddada et al. (2010). Version 3.0 of ORIOGEN has the following advantages over its predecessor versions:

The user pre-specifies a list of profiles (or patterns) of mean gene expression over time/dose that may be of interest for a specific experiment. The present version of ORIOGEN can detect increasing, decreasing, umbrella-shaped or inverted-umbrella-shaped patterns and cyclic patterns (up to one cycle only).

Here the word mean refers to the population mean (which is unknown) and not the sample mean, which is calculated from given data and provides an estimate of the population mean. Thus the profiles are described in terms of the population means. Note that, since sample mean is a random realization from a population, the observed sample mean expressions over time/dose may not conform exactly to a pattern of mean expression satisfied by the population means. For example, the experimenter may be interested in selecting a gene whose mean expression increases with time/dose (known as increasing shape). However, due to the randomness in the data, the observed sample means may not necessarily have an increasing profile. Similarly, an experimenter may be interested in selecting genes where the mean expression increases with time/dose up to a certain point and then decreases (i.e. umbrella shape). Due to randomness in the data, the sample means may not necessarily follow this pattern.

ORIOGEN does not normalize the data, so it is recommended that the user pre-process the data by applying a suitable normalization method before submitting the data to ORIOGEN. ORIOGEN selects genes, based on a statistical decision rule with a pre-specified level of significance, and clusters each selected gene into an appropriate "best-fitting" pattern or profile. The methodology can be briefly described as follows.

ORIOGEN expresses each pre-specified profile in terms of mathematical inequalities (known as order restrictions) between the mean expressions. Then using the methodology developed in Hwang and Peddada (1994) it fits each pre-specified profile to each gene. Thus for a given gene, ORIOGEN computes a "goodness-of-fit" statistic for each candidate profile. For a gene g, it then tests for the significance using a minor modification to the statistic Description: Description: Description: \\userdata.niehs.nih.gov\peddada\My Documents\From C folder\ALL Shawn Jar Files\ORIOGEN 3.04 and 4.01 USE THIS\4.01 bootstrap residuals\edits to be made\about_help_files\image002.gifobtained in Step 3 of Peddada et al (2003). This modification replaces Step 7 described in Peddada et al (2003). As in the well-known SAM methodology we include a "fudge factor" s0. This factor is calculated as the nth percentile of the SSE values for all of the genes, where the "n" is input by the user. Typical values for s0 are 10% for repeated measures data and 0% (disabled) for ordinary data. The modified test statistic is

Description: Description: Description: \\userdata.niehs.nih.gov\peddada\My Documents\From C folder\ALL Shawn Jar Files\ORIOGEN 3.04 and 4.01 USE THIS\4.01 bootstrap residuals\edits to be made\about_help_files\image004.gif

where, for gene g, sg is the pooled sample standard deviation for all time points/dose groups, and nj and nk are the number of replicates at the endpoints of the Description: Description: Description: \\userdata.niehs.nih.gov\peddada\My Documents\From C folder\ALL Shawn Jar Files\ORIOGEN 3.04 and 4.01 USE THIS\4.01 bootstrap residuals\edits to be made\about_help_files\image002.gifregion.

The P-values are obtained using bootstrap methodology by bootstrapping the residuals. The resulting bootstrap methodology is valid if repeated measurements are made on each subject over time or if, within each gene, the data are heteroscedastic over time. For each gene the number of bootstrap samples are selected adaptively using the methodology provided in Guo and Peddada (2008). This modification results in a substantial reduction in computation time while maintaining the desired FDR level.

Once a gene is declared significant by the above process, it is assigned to the profile with the largest goodness-of-fit statistic.

The results for significant genes are saved to a text file and their profile fitted means as well as the raw sample means are displayed graphically. Q-values for each selected gene are calculated and stored in the output file (Storey, 2002). Gene ontology information for significant genes is provided where available.

Remarks:

Acknowledgements

We thank Drs. Leping Li, David M. Umbach and Clarice Weinberg, Biostatistics Branch, NIEHS, for numerous discussions and their feedback during the preparation of this an earlier version of this software.

References

Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall Inc, New York, NY.

Guo, W., and Peddada, SD. (2008). Adaptive Choice of the Number of Bootstrap Samples in Large Scale Multiple Testing. Statistical Applications in Genetics and Molecular Biology, 7 (1), Art. 13.

Guo W, Sarkar SK, Peddada, SD* (2010). Controlling False Discoveries in Multidimensional Directional Decisions, with Applications to Gene Expression Data on Ordered Categories. Biometrics, 66, 485 - 492.

Hwang, J. and Peddada, S. (1994). Confidence interval estimation subject to order restrictions. Annals of Statistics, 22, 67-93.

Liu, D., Umbach, D. M., Peddada, S. D., Li, L., Crockett, P., and Weinberg, C. (2004). A Random-Periods Model for Expression of Cell-Cycle Genes. Proceedings of National Academy of Sciences, 101, No. 19, 7240-7245.

Peddada, S., Lobenhofer, E., Li, L., Afshari, C., Weinberg, C., and Umbach. D. M. (2003). Gene selection and clustering for time-course and dose-response microarray experiments using order restricted inference. Bioinformatics, 7, 834-841.

Peddada SD, Harris S, Zajd J and Harvey E. (2005). ORIOGEN: Order Restricted Inference for Ordered Gene Expression data. Bioinformatics, 21, 3933-3934.3934..

 

Peddada SD, Harris SF, Davidov O (2010). Analysis of Correlated Gene Expression Data on Ordered Categories. Journal of the Indian Society of Agricultural Statistics Indian Society of Agricultural Statistics. 64(1):45-60. Epub 2010/01/01. PubMed PMID: 21998487; PubMed Central PMCID: PMC3190572.

Peddada, SD, Harris, SF, and Davidov, O. (2010) Analysis of Correlated Gene Expression Data on Ordered Categories. Journal of the Indian Society of Agricultural Statistics , to appear

Peddada SD, Harris S, Zajd J and Harvey E. (2005). ORIOGEN: Order Restricted Inference for Ordered Gene Expression data. Bioinformatics, 21, 3933-3934.

Storey, J.D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B, 64, 479-498.


 [PS1]Can you please fix this?