Overview

The advent of next generation sequencing has revolutionized the manner in which DNA is sequenced and genomic events are monitored. RNA-Seq has the potential to capture the expression of every gene and its isoform(s) genome-wide. Modeling alternative tandem 3’UTRs in a dynamic fashion is an important problem in post-transcriptional regulation of mRNA and the progression of disease processes. This motivated the development of an RNA-Seq analysis method that specifically targets the 3’ UTR and dynamically models gene expression termination at polyadenylation sites. Here, RNA-Seq data and a dynamic approach are used to identify shortening of 3’UTRs. The approach uses a Poisson hidden Markov model (PHMM) to 1) estimate (hidden) states of gene expression levels in terminal exon 3’ UTRs, 2) infer shortening of the region and 3) demonstrate alternative polyadenylation.

Workflow for PHMM 3’UTR shortening

Workflow diagram for PHMM 3’UTR shortening
 

PHMM Method

First collect all the terminal exons located within the 3’ UTR region of the gene model transcripts. A sliding window of k base-pairs (bp) is applied to each terminal exon, where the number of reads mapped to each sliding window was recorded and where:

k=100 if 1<2kb; k=200 if l<4kb; k=400 if l<8kb; k=800 if l>8kb

The Poisson-based hidden Markov model (PHMM) is used to capture the sequence of read counts. In a PHMM one considers a sequence of discrete observations

Formula for sequence of discrete observations in a PHMM , which are assumed to be generated from a sequence of unobservable finite state Markov chains Formula of sequence of unobservable finite state Markov chains in PHMM with a finite state space Formula for finite state space in PHMM , and the random variable Yt conditioned on Xt has a Poisson distribution for every t. Specifically, if Example of Poisson distribution for PHMM , the emission probabilities are given by a Poisson distribution with parameter Poisson distribution with parameter in PHMM , i.e.,
formula for emission probabilities with poisson distribution with a parameter

Next let Yi,j be the transition probability from state i at time t – 1 to state j at time t, and assume

Formula for defining transition probabilites in PHMM

for any

Defined Transition Probability in PHMM . By defining the emission and transition probabilities, along with the initial state probabilities, one can estimate all the unknown parameters for a given observed sequence Observed Sequence in PHMM, using the maximum likelihood approach. In the RNA-Seq case, Observed Sequence in PHMM is a sequence of read counts with Example of RNA-Seq in Phmm where t=1,2,...,n where N is the number of k-bp sliding windows in a 3' terminal exon in PHMM is the number of k-bp sliding windows in a 3’ terminal exon. Fit the sequence with a two-state PHMM, i.e. Example of a two-state PHMM with parameters estimated by the Expectation and Maximization (EM) method. Transcripts with potential shortened 3’ UTRs (i.e. having 2 or more states) were selected based on the Bayesian information criterion (BIC) of the model fits as follows:
Transcripts with potential shortened 3’ UTRs (i.e. having 2 or more states) were selected based on the Bayesian information criterion (BIC) of the model

where BIC1 is the Bayesian information criterion from the 1-state model and BIC2 is the Bayesian information criterion from the 2-state model. If a 2-state model is preferred, select transcripts with transitions from high-expression state to the low-expression state (i.e.,3’UTR shortening).

Reference for Citing

Jun Lu and Pierre R. Bushel. Dynamic expression of 3’ UTRs revealed by Poisson hidden Markov modeling of RNA-Seq: implications in gene expression profiling. Gene 2013

Requirements

  • R version ≥ 2.15.0
  • GenomicFeatures, Rsamtools and GenomicRanges (packages from Bioconductor)
  • depmixS4  (package from CRAN)
  • Sorted bam file (and accompanying index file) with alignment of reads to a reference genome

Download the R Scripts

phmm.zip (7KB)

A Readme file is included to instruct on the execution of the scripts. Report bugs, corrections, and suggestions to Pierre Bushel at [email protected].

Public Domain Notice

This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.

Contact

Pierre R. Bushel, Ph.D.
Special Volunteer
Tel 919-618-1945
[email protected]