Over-representation analysis (ORA) detects enrichment of genes within biological categories. Gene Ontology (GO) domains are commonly used for gene/gene-product annotation. When ORA is employed, often times there are hundreds of statistically significant GO terms per gene set. Comparing enriched categories between a large number of analyses and identifying the term within the GO hierarchy with the most connections is challenging. Furthermore, ascertaining biological themes representative of the samples can be highly subjective from the interpretation of the enriched categories.
goSTAG was developed for utilizing GO Subtrees to Tag and Annotate Genes that are part of a set. Given gene lists from microarray, RNA sequencing (RNA-Seq) or other genomic high-throughput technologies, goSTAG performs GO enrichment analysis and clusters the GO terms based on the p-values from the significance tests. GO subtrees are constructed for each cluster, and the term that has the most paths to the root within the subtree is used to tag and annotate the cluster as the biological theme.
goSTAG is developed in R as a Bioconductor package and is available at https://bioconductor.org/packages/goSTAG.
Workflow for goSTAG
The workflow for goSTAG proceeds as follows:
First, gene lists are loaded from analyses performed within or outside of R. For convenience, a function is provided for loading gene lists generated outside of R. Then, GO terms are loaded from the biomRt package. Users can specify a particular species (human, mouse, or rat) and a GO subontology (molecular function [MF], biological process [BP], or cellular component [CC]). GO terms that have less than the predefined number of genes associated with them are removed.
Next, GO enrichment is performed and p-values are calculated. Enriched GO terms are filtered by p-value or a method for multiple comparisons such as false discovery rate (FDR), with only the union of all significant GO terms remaining. An enrichment matrix is assembled from the –log10 p-values for these remaining GO terms.
goSTAG then performs hierarchical clustering on the matrix using a choice of distance/dissimilarity measures, grouping algorithms and matrix dimensions.
Then, based on clusters with a minimum number of GO terms, goSTAG builds a GO subtree for each cluster. The structure of the GO parent/child relationships is obtained from the GO.db package. The GO term with the largest number of paths to the root of the subtree is selected as the representative GO term for that cluster.
Finally, goSTAG creates a figure in the active graphic device of R that contains a heatmap representation of the enrichment and the hierarchical clustering dendrogram, with clusters containing at least the predefined number of GO terms labeled with the name of its representative GO term.
Functions available in goSTAG
The goSTAG package contains seven functions:
- loadGeneLists: loads sets of gene symbols for over-representation analysis that are in Gene Matrix Transformed (GMT) format or text files in a directory
- loadGOTerms: provides the assignment of genes to GO terms
- performGOEnrichment: performs the ORA of the genes enriched within the GO categories and computes p-values for the significance based on a hypergeometric distribution
- performHierarchicalClustering: clusters the enrichment matrix
- groupClusters: partitions clusters of GO terms according to a distance/dissimilarity threshold of where to cut the dendorgram
- annotateClusters: creates subtrees from the GO terms in the clusters and labels the clusters according to the GO terms with the most paths back to the root
- plotHeatmap: generates a figure within the active graphic device illustrating the results of the clustering with the annotated labels and a heat map with colors representative of the extent of enrichment
Availability and requirements
Project Name: goSTAG
Project Homepage: The R Bioconductor package goSTAG is open source and available at https://bioconductor.org/packages/goSTAG
Operating System: Platform independent
Programming Language: R version ≥ 3.4.0
Citation for goSTAG
Bennett BD, Bushel PR. goSTAG: gene ontology subtrees to tag and annotate genes within a set. Source Code Biol Med. 2017 Apr 13;12:6. doi: 10.1186/s13029-017-0066-1. eCollection 2017. PubMed PMID: 28413437; PubMed Central PMCID: PMC5390446. [Abstract Bennett BD, Bushel PR. goSTAG: gene ontology subtrees to tag and annotate genes within a set. Source Code Biol Med. 2017 Apr 13;12:6. doi: 10.1186/s13029-017-0066-1. eCollection 2017. PubMed PMID: 28413437; PubMed Central PMCID: PMC5390446.]
Report bugs, corrections and suggestions to Brian D. Bennett and Pierre R. Bushel
Brian D. Bennett, Ph.D.
Public Domain Notice
This is U.S. government work. Under 17 U.S.C. 105 no copyright is claimed and it may be freely distributed and copied.