Enhancing Integration, Interoperability, and Reuse of Data

Superfund Research Program

The NIEHS Superfund Research Program (SRP) hosted a Risk e-Learning webinar series focused on SRP-funded data science projects that are enhancing the integration, interoperability, and reuse of data. With these supplements, the SRP encourages data sharing among its grant recipients to accelerate scientific discoveries, stimulate new collaborations, and increase scientific transparency and rigor.

The series featured research by SRP grant recipients and colleagues collaborating to enhance the integration, interoperability, and reuse of SRP-generated data to generate new findings and answer questions that could not be answered before. We also heard from outside speakers from the U.S. Environmental Protection Agency (U.S. EPA), National Science Foundation (NSF), and the Global Alliance for Genomics and Health (GA4GH), who have complementary expertise in data sharing tools and initiatives.

Session I – Data Sharing Tools, Workflows, and Platforms

Monday, May 17, 2021, 1:00 PM-3:00 PM EDT

To view the archive, visit EPA's CLU-IN Training & Events webpage.

The first session introduced tools, strategies, workflows, and platforms developed by SRP researchers to organize existing data obtained from measuring contaminants in an array of environmental media to facilitate interoperability. These strategies were developed to enable researchers to reuse the data to better characterize and understand contaminants present in the environment. We heard about the U.S. Environmental Protection Agency’s (U.S. EPA’s) CompTox Chemicals Dashboard, a compilation of information from many sites and databases developed to organize chemical data and address data gaps.

Speakers:

William Suk, Ph.D., M.P.H., NIEHS Superfund Research Program
Brittany Saleeby, University of California, Davis
Benjamin Bostick, Ph.D., Columbia University, and Tracy Punshon, Ph.D., Dartmouth College
Antony Williams, Ph.D., U.S. EPA
Moderator: Michelle Heacock, Ph.D., NIEHS Superfund Research Program

SRP Director William Suk, Ph.D., M.P.H., provided an overview of the series and briefly discussed the rationale and goals of the SRP’s data science initiative.

Brittany Saleeby, trainee at the University of California, Davis (UC Davis) SRP Center, discussed a UC Davis and Duke University SRP center research collaboration focused on how disparate sources of high-resolution mass spectrometry data can be combined to harmonize approaches for non-targeted environmental analysis. Post-processing programs used by each lab for non-targeted mass spectrometry data resulted in dissimilar numbers of molecular features. The project’s current focus is describing the similarities and differences between analysis results and describing the chemical space that each data process is expected to recognize.

Benjamin Bostick, Ph.D., of the Columbia University SRP Center, and Tracy Punshon, Ph.D., of the Dartmouth College SRP Center, described the development of a biological elemental imaging database that brings FAIR principles (Findable, Accessible, Interoperable and Reusable) to an archive of elemental maps collected from model plants (Arabidopsis and rice) at multiple synchrotron facilities in the US over a 10-year period. They use an example of integrated, interoperable datasets to find parallels in the chemical processes affecting drinking water quality across widely disparate communities in Bangladesh, Vietnam and the Northern Plains region of the US.

Antony Williams, Ph.D., of the Center for Computational Toxicology and Exposure in the Office of Research and Development at the U.S. EPA, the U.S. EPA CompTox Chemicals Dashboard website for Environmental Science Data, focused on the CompTox Chemicals Dashboard. This free website provides access to data for ~900,000 chemicals. These data include property data, in vivo and in vitro toxicity data, exposure information and flexible searches for one chemical at a time or many thousands. This presentation provided a basic overview of the Dashboard, its capabilities, and how it can help environmental scientists quickly source relevant data.

Session II – Geospatial Platforms for Analysis and Visualization Across Environmental Data

Thursday, June 3, 2021, 2:00 PM-4:00 PM EDT

To view the archive, visit EPA's CLU-IN Training & Events webpage.

In the second session, presenters described efforts to combine and analyze data sets from SRP Centers and other sources using geospatial platforms. This session also featured a speaker supported by NSF who will discuss Hydroshare, an online system to share hydrologic data and models.

Speakers:

Pianpian Wu, Ph.D., Dartmouth College, and Caredwen Foley, M.P.H., Boston University
Andrew Creamer, Brown University
David Tarboton, Sc.D., Utah State University (Supported by NSF)
Moderator: Leslie Hsu, Ph.D., United States Geological Survey

Pianpian Wu, Ph.D., postdoctoral researcher at Dartmouth College and Caredwen Foley, M.P.H., research assistant at the Boston University SRP Center, discussed an examination of fish consumption advisories. To date, fish consumption advisories have been established for single contaminants, including mercury, chlordane, dichlorodiphenyltrichloroethane (DDT), dioxins, and PCBs. Surprisingly, the co-occurrence of multiple contaminants in fish tissue from these water bodies has not been systematically examined. Their work addresses the spatial differences in multi-chemical co-exposures from consuming fish, as well as highlights a need to revisit the approach to establishing fish consumption advisories that reflect exposure to chemical mixtures to address the totality of risks and benefits of fish consumption.

Andrew Creamer, research data management librarian at Brown University, described opportunities for partnering with libraries for supporting public access to research data and data reuse. He focused on lessons learned from collecting and curating historical industrial land use data and its potential applications in data analysis and visualization and the challenges of curating publicly available datasets for reuse learned from Developing a Spatial Approach for Toxic Transferal from Industrial and Vacant Land Uses to Green Infrastructure.

David Tarboton, Sc.d., Professor, Director of the Utah Water Research Laboratory, and Professor of Civil and Environmental Engineering at Utah State University introduced HydroShare, a repository developed for sharing data and models within the hydrology and water resources community served by CUAHSI. The presentation described HydroShare functionality for capturing and holding metadata as well as tools for acting on data in HydroShare that make data sharing attractive beyond open data mandates and enable problem solving through data integration. He discussed lessons learned and challenges we still face in the management and reuse of water related data for integrated problem solving.

Session III – Integrating Omics Data Across Model Organisms and Populations

Tuesday, August 3, 2021, 2:00 PM-4:00 PM EDT

To view the archive, visit EPA's CLU-IN Training & Events webpage.

The third and final session featured SRP-funded researchers collaborating to combine omics (e.g., genomics, proteomics) data within and across model organisms as well as studies in human populations. We also heard from The Global Alliance for Genomics and Health about their work to incorporate semantic data models for sharing of genomic data to align with environmental health research.

Speakers:

Monica Munoz-Torres, Ph.D., Anne Thessen, Ph.D., and Melissa Haendel, Ph.D., University of Colorado, Anschutz Medical Campus
Mark Hahn, Ph.D., Woods Hole Oceanographic Institutions and Boston University, and Adam Labadorf, Ph.D., Boston University
Christian Powell, University of Kentucky
Andres Cardenas, Ph.D., and Anne Bozack, Ph.D., University of California, Berkeley
Moderator: Stephanie Holmgren, NIEHS Office of Data Science

Monica Munoz-Torres, Ph.D., Anne Thessen, Ph.D., and Melissa Haendel, Ph.D., presented on The Global Alliance for Genomics and Health and discussed harmonizing and sharing phenotypes across organisms for diagnostics and mechanism discovery. Describing phenotypes in a way that is as computable as sequences has been a major barrier to biomedicine. Within the Global Alliance for Genomics and Health, members of the Monarch Initiative have led the creation of the Phenopacket standard. Phenopackets are computable phenotypic profiles encoded using ontologies, which can be used to relate a patient’s phenotypic profile to model organisms and thereby improve diagnostics, mechanism discovery, and integration with environmental health data. Melissa Haendel, Ph.D., is the Chief Research Informatics Officer at University of Colorado Anschutz Medical School, the director of the Center for Data to Health (CD2H), and an elected Fellow of the American Medical Informatics Association. Monica Munoz-Torres, Ph.D., is a Program Manager in the Translational and Integrative Sciences Lab (TISLab). Anne Thessen, Ph.D., is a Semantics Engineer in the TISLab.

Mark Hahn, Ph.D., Senior Scientist at Woods Hole Oceanographic Institution and researcher with the Boston University SRP Center, and Adam Labadorf, Ph.D., Assistant Professor at Boston University, described the integration of diverse genome-scale data sets to obtain a better understanding of how genetic variation influences sensitivity and resistance to hazardous chemicals. The research builds on research on the genetic mechanisms underlying evolved resistance to polychlorinated biphenyls and polycyclic aromatic hydrocarbons in Atlantic killifish (Fundulus heteroclitus) populations living at Superfund sites. They also described a new JBrowse instance that integrates multiple genomic data sets, facilitating comparison of genetic variation among killifish populations; they discussed the potential to expand this platform for cross-species comparisons (e.g., zebrafish, human).

Christian Powell, graduate student at the University of Kentucky SRP Center, focused on the Metabolomics Workbench which is a public scientific data repository consisting of experimental data and metadata from metabolomics studies collected from mass spectroscopy and nuclear magnetic resonance analyses. In order to keep up with the ever-evolving state of the Metabolomics Workbench repository, the open source mwtab Python package has been updated to mirror the changes in the mwTab file format and now contains enhanced file validation features, methods for utilizing the Metabolomics Workbench REpresentational State Transfer (REST) interface, and additional features for parsing metabolite data and metadata from repository entries. All of these resources are available through an updated Command Line Interface (CLI) and as a Python application programming interface (API). The mwtab package continues to strive to promote the most FAIR utilization of the Metabolomic Workbench repository, coevolving with and continuing to improve alongside the repository.

Andres Cardenas, Ph.D., Assistant Professor of Environmental Health Sciences at the University of California, Berkeley SRP Center, and Anne Bozack, Ph.D., postdoctoral fellow in the Cardenas Lab at the University of California, Berkeley SRP Center, discussed arsenic epigenetics META: Meta-analysis of Epigenome Data on Arsenic. Epigenome-wide association studies (EWAS) of environmental exposures are commonly restricted to single populations with small sample sizes, and comparison across EWAS has been limited by methodological differences. To address these limitations, they developed a two-step process of (1) harmonized data processing and analysis and (2) meta-analysis to combine results across EWAS. The researchers leveraged data from epidemiological studies of arsenic-exposed populations in Chile and Bangladesh, including DNA methylation measured in different tissue types (i.e., PBMCs and buccal cells) and using different platforms (i.e., the 850K and 450K microarrays), to identify arsenic-related DNA methylation signatures.

National Institute of Environmental Health Sciences

Webcasts

Your Environment. Your Health.

Enhancing Integration, Interoperability, and Reuse of Data

Superfund Research Program