Leping Li has been involved in the following research projects within the Biostatistics Branch.
Optimized Mixed Markov Models for Motif Identification
Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. Li and his staff introduced a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa incorporates more than the NNSplice's pairwise dependencies, avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM) and requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), resulted in data that suggested that OMiMa's performance is superior to the other leading methods in terms of prediction accuracy, required size of training data and computational time. The group’s optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif.
Accurate Anchoring Alignment of Divergent Sequences
Obtaining high quality alignments of divergent homologous sequences for cross-species sequence comparison remains a challenge. Li’s staff has proposed a novel pair-wise sequence alignment algorithm, ACANA (ACcurate ANchoring Alignment), for aligning biological sequences at both local and global levels. Like many fast heuristic methods, ACANA uses an anchoring strategy. However, unlike others, ACANA uses a Smith-Waterman-like dynamic programming algorithm to recursively identify near-optimal regions as anchors for a global alignment. Performance evaluations using a simulated benchmark dataset and real promoter sequences suggest that ACANA is accurate and consistent, especially for divergent sequences. Specifically, using a simulated benchmark dataset, Li’s group has shown that ACANA has the highest sensitivity to align constrained functional sites compared to BLASTZ, CHAOS and DIALIGN for local alignment and compared to AVID, ClustalW, DIALIGN and LAGAN for global alignment. Applied to 6007 pairs of human-mouse orthologous promoter sequences, ACANA identified the largest number of conserved regions—defined as greater than 70% identity over 100 bp—compared to AVID, ClustalW, DIALIGN and LAGAN. In addition, the average length of a conserved region identified by ACANA was the longest. Thus, the group suggests that ACANA is a useful tool for identifying functional elements in cross-species sequence analysis, such as predicting transcription factor binding sites in non-coding DNA.
A Method for Gene Set Enrichment Analysis for Continuous Non-Monotone Relationships
Gene set enrichment analysis (GSEA) uses a local statistic to assess the association between the expression level of a gene and the value of a phenotypic endpoint. Commonly used local statistics include t statistics for binary phenotypes and correlation coefficients that assume a linear or monotone relationship between a continuous phenotype and gene expression level. Methods applicable to continuous non-monotone relationships are needed. Herein, Li’s group proposes to use, as the local statistic, the square of multiple correlation coefficient R2 from fitting natural cubic spline models to the phenotype-expression relationship. Li’s group incorporates this association measure into the GSEA framework to identify significant gene sets. Furthermore, the group describes a procedure for inference across multiple GSEA analyses.
Optimize Position Weight Matrix (PWM) for Motif Detection
PWMs have been widely used to scan promoter sequences for putative transcription factor binding sites. PWMs for many transcription factors are currently available in the Transfac database. However, PWMs are usually created from a few known—but not always validated—transcription factor binding sites, resulting in a poor estimate of the PWM. Thus, the results obtained with these PWMs may not be reliable. While identifying and validating a binding site is time consuming, chromatin immunoprecipitation (ChIP) with microarray (ChIP on chip) provides an alternative way to identify many low-resolution binding sites, for example, regions rather than specific binding sites. Recently, this technology has been successfully applied to Oct4, Sox2, Nanog and p53. Because ChIP can not pinpoint the exact location of a binding site, such data are rarely used in constructing a PWM. Li’s group has developed a method that would allow one to build a statistical model from the ChIP data.
A Genetic Algorithm/k-nearest Neighbor (GA/KNN) Method for Microarray and Proteomics Data Analysis
Li’s group describes a method for assessing the importance of genes for sample classification based on expression data. The approach combines a genetic algorithm (GA) and the k-nearest neighbor (KNN) method to identify genes that can jointly discriminate between two types of samples, such as normal versus tumor. First, many such subsets of differentially expressed genes are obtained independently using the GA. Then, the overall frequency with which genes were selected is used to deduce the relative importance of genes for sample classification. Sample heterogeneity is accommodated. That is, the method should be robust against existence of distinct subtypes. Li applied GA/KNN to expression data from normal versus tumor tissue from human colon. Two distinct clusters were observed when the 50 most frequently selected genes were used to classify all of the samples in the data sets studied and the majority of samples were classified correctly. Identification of a set of differentially expressed genes could aid in tumor diagnosis and could also serve to identify disease subtypes that may benefit from distinct clinical approaches to treatment.