Leping Li, Ph.D.

Leping Li, Ph.D. has been involved in the following research projects within the Biostatistics and Computational Biology Branch.

Methods and Applications for Motif Analysis

We have longstanding interest in methods for identifying transcription factor binding sites in sequences. This area offers an interesting and challenging set of methodological problems as well as the opportunity to work closely with laboratory scientists who are developing an understanding of mechanisms of transcriptional regulation. Transcription factor binding sites are short functional elements (5-30 base pairs) in the genome which, when bound by proteins, induce or repress gene transcription. Transcription controls temporal and spatial gene expression; thus, it is vital to many cellular processes. We have been working on methods to improve the accuracy of transcription factor binding site identification. Specifically, we developed an efficient algorithm for identifying conserved segments between two promoter sequences. Like many others, we reasoned that functional elements are more likely to be found in conserved regions than non-conserved regions. Similarly, we worked on methods that improve the quality of statistical models for binding site identification. We developed a publicly available tool for de novo discovery of transcription factor binding sites in a set of DNA sequences based on statistical over-representation of the binding sites without requiring prior knowledge on where the binding sites are located and what they look like. We developed a statistical method for identifying transcription co-regulators’ motifs in ChIP-seq data. It is known that multiple transcription factors may work together to regulate gene expression in development and specification. Most existing methods for motif discovery consider only one motif at a time. We have developed a multi-component mixture framework to model the joint distribution of two motifs. We classify a sequence as containing either motif 1 or motif 2, both motifs 1 and 2, or pure statistical “noise”. We have also developed a method for identifying enriched motifs in the promoters of a set of genes.

T-KDE: A method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets

Not all proteins bind to DNA directly however. For example, some proteins may bind to another protein already bound to DNA. For those proteins that bind indirectly, distinct sequence signatures (motif) on DNA may not exist. Alternative robust methods for identifying constitutive protein binding sites were needed. We proposed an effective and efficient alternative to binning for locating binding sites for proteins that may bind directly or indirectly. It uses peak centers from ChIP-seq (chromatin immunoprecipitation followed by sequencing) as input data. Our algorithm, T-KDE, identifies binding site locations by combining a kernel density estimator (KDE) with a binary range tree. Kernel density estimation is an unsupervised and non-parametric technique for estimating a continuous probability density function from sample data. For our purpose, we used a KDE to find those genomic regions that contain the highest density of ChIP-seq peak centers from multiple cell lines/types for a given TF. Use of a binary range tree in conjunction with kernel density estimation enhances T-KDE’s speed. A binary range tree is a helpful algorithm for many applications involving range or nearest neighbor searches, indexing and clustering; we used it to recursively subdivide the set of peak centers into subgroups that allow efficient density estimation and mode finding.

Using information on the location of peak centers from 132 (additional datasets became available after our previous work on CTCF) CTCF ChIP-seq datasets from the ENCODE project, we compared T-KDE to both the motif-based approach and the binning approach. T-KDE outperformed the binning approach and was competitive with the motif-based approach. More than 90% of the T-KDE-declared constitutive CTCF binding sites were within 20 base pairs (bp) from the nearest motif-declared constitutive CTCF binding sites (16-bp canonical motif) — indicating that T-KDE is highly accurate. In addition, T-KDE also identified additional constitutive CTCF binding sites that the motif-based approach failed to find due to lack of apparent motif sites in the ChIP-seq peaks. We also applied T-KDE to 21 other proteins for which replicate ChIP-seq datasets were available in six or more cell lines and found that the number of constitutive binding sites per protein varied from fewer than a hundred to tens of thousands.

Development of Methods for Analyzing Next-gen Sequencing Data

IUTA: a tool for effectively detecting differential isoform usage from RNA-seq data

We were interested in the problem of differential isoform usage. “Differential” means differences between two groups of samples and “isoform usage” denotes the set of relative abundances (proportions of total gene expression) of all isoforms of a gene. Our method recognizes that detecting differential isoform usage for a given gene between two groups of samples is a different problem from detecting differential expression of any particular individual isoform. The former problem, by focusing on the entire compositional vector, implicitly incorporates the constraint that the fractional contributions of the component isoforms sum to one.

We referred to our algorithm as IUTA (Isoform Usage Two-step Analysis). For a given gene, IUTA first estimates isoform usage in each sample based on paired-end RNA-seq alignment data using a statistical model similar to one used by MISO. IUTA then tests for differential isoform usage between the two groups using those estimates. Because isoform usage is a type of compositional data, i.e., vectors that contain the proportions of each component comprising the whole unit, IUTA defines the statistical testing problem as a test of equal means for two multivariate distributions under the Aitchison geometry instead of under the usual Euclidean geometry. The Aitchison geometry is widely regarded more suitable for compositional data analysis than Euclidean geometry. The IUTA package is now available online.

ART, a tool set for next-generation sequencing simulation

ART was originally developed in 2012 but has been frequently upgraded and maintained ever since. ART is a simulation toolset we developed to generate a variety of sequencing data for three sequencing platforms: Illumina, Roche 454 and SOLiD. ART has been broadly used to test/benchmark a variety of method or tools for next-gen sequencing data analysis, including read alignment, de novo assembly, single nucleotide polymorphism (SNP) and structure variation discovery. The ART package is freely available online.

PAVIS: an annotation and visualization tool for ChIP-seq data for biologists

For a genome-wide study, the number of significant peaks can be tens to hundreds of thousands. The biological relevance of a ChIP-seq peak and the functions of its underlying DNA elements are often dependent on its position relative to nearby genes or other functional elements. It can be a challenging and time-consuming task to examine all peaks, and to develop meaningful biological interpretations of their functional relevance. Motivated by this challenge, we developed a web-based tool Peak Annotation and Visualization (PAVIS) to facilitate data comparison, interpretation and hypothesis generation from ChIP-seq peak data. PAVIS was designed with non-bioinformaticians in mind and presents a straightforward user interface to facilitate biological interpretation of ChIP-seq peak or other genomic enrichment data. PAVIS, unlike many other resources, provides a peak-oriented annotation and visualization system, allowing dynamic visualization of tens to hundreds of loci from one or more ChIP-seq experiments, simultaneously.

Methods and Applications for Mining High-dimensional Genomic Data

Our lab has longstanding interest in methods and applications for mining high-dimensional genomic data using a variety of classification and datamining methods that are either publicly available or that we developed ourselves, such as genetic algorithm/k-nearest neighbor (GA/KNN), topic modeling, gradient boosting machines, random forest, support vector machines, kernel density estimation, variable length Markov models, hidden Markov models and many others.

Recently, we undertook a comprehensive pan-cancer classification of 9,096 tumor samples from 31 tumor types from TCGA using RNA-seq gene expression data using the GA/KNN algorithm. We aimed to identify a set of genes whose expression levels can classify all 31 TCGA pan-cancer tumor types. Moreover, we sought to identify, separately in men and in women, analogous sets of genes that can distinguish the 23 sex non-specific tumors types. We hope to gain insight into sexual dimorphism in some tumors from those analyses.

We are interested in understanding the roles of the tumor microenvironment in tumor progression and metastasis. Each tumor is composed of multiple cell types; moreover, the proportional representation of cell types in a tumor differs across patients. We have been developing a computational deconvolution method that extracts cell-type-specific expression profiles, as well as the proportional representation of each cell type in a tumor, directly from expression levels measured on aggregate tumor samples. Our hierarchical Bayesian approach models multiple cell types as latent structures through a multinomial random variable of fixed dimension. We assume the measured expression of a gene in a sample is a weighted average of the gene’s expression in each component cell type with weights based on relative proportions. We use Dirichlet random variables to represent both cell-type-specific relative proportions and cell-type-specific expression profiles across genes. Further, we develop a novel Metropolis-Hastings sampler to estimate needed posterior distributions. Cell types are identified by comparing the cell-type-specific expression profiles estimated by our procedure with an existing list of reference profiles or by cell-type specific marker genes.

Relevance to NIEHS Mission

Environmental exposures are major contributors to the diseases that we have studied – cancers. Melanoma, for example, is caused in part by exposure to UV radiation from the sun, and the etiology of lung cancer involves the joint effects of inherited genetic variants and environmental exposures, such as radiation, cigarette smoke and other forms of air pollution. Gene expression patterns in tumors have the potential to provide molecular signatures of specific exposures. The tools and methods that we develop can thus contribute to advancing environmental health science research.