Researchers sought to create tools, platforms, and strategies to make data more findable, accessible, interoperable, and reusable. For definitions of common data science and sharing terms, see the glossary on the landing page. For more information about these use cases, please refer to the White Paper (2MB).
Improving the Robustness and Toxicological Significance of Nontarget Chemical Identification in High Resolution Mass Spectrometric Data
Collaborating Institutions:
Duke University and University of California (UC) Davis SRP Centers, and the U.S. Environmental Protection Agency (EPA)
Abstract:
Superfund sites contain complex chemical mixtures, where mixture components may be known or unknown. Samples of these mixtures sometimes lack data on methods of detection, environmental occurrence, or toxicity. High-resolution mass spectrometry (HRMS) allows researchers to look for and measure multiple unknown chemicals, an approach called non-targeted analysis.
Researchers at the Duke University and the UC Davis SRP Centers worked together with an external collaborator from the EPA to explore how to harmonize and combine sources of high-resolution mass spectrometry data to improve non-targeted analysis of complex mixtures of environmental contaminants. They aimed to develop methods with open-source data analysis software by performing an intercomparison study, in which both centers shared sample spectra between each lab and compared results of non-targeted analysis. They also sought to develop new approaches to link toxicity data with non-targeted screening results.
Research Question:
How can disparate sources of high-resolution mass spectrometry data be combined to harmonize approaches for non-targeted environmental analysis?
Data Sources and Data Sets:
Source | Dataset | Data metrics | Format | Analysis Instrument |
Duke SRP Center | Non-targeted analysis results from ESI (+) HRAM MS/MS analysis of North Carolina river water | Data metrics: 36 water samples measured (72 LC | .RAW | Thermo Orbitrap instrument platforms |
UC Davis SRP Center | Non-targeted analysis results from ESI (-) HRAM MS/MS analysis of Yurok Tribe Sediment | Data metrics: 38 sediment samples measured (76 LC MS/MS analyses, ESI+/ ESI+/--) representing 215 GB of mass spectral data) | .D | Agilent instrument platforms |
HRMS data from spectral libraries | ||||
HRMS data from spectral libraries |
Data Repositories:
Approach:
The research team converted raw data obtained using proprietary software into an open-source format—even developing protocols and algorithms for compound identification and annotation, and consistent performance metrics. The team uploaded data to an open-source data sharing software to share with collaborators and built software and wrote code to combine and translate different open-source spectral libraries into one library.
Outcomes:
The researchers created an open-source and fully accessible library to store mass spectrometry data that combines and harmonizes data from several online chemical repositories. The team developed standard formats for compound annotation and data sharing and storage and established a framework to collect new biosensor data that will improve data interoperability and re-use. They also facilitated access to a higher resolution of analytical data to match against unknown compounds and environmental samples.
Integrated Datasets, Portals/Dashboards, Tools, Code:
Code used to harmonize libraries will be made available at https://github.com/fergusonlabduke upon publication.
Improvement of Small Molecule Biosensor Probe Development and Biomedical Applications Through the Integration and Reuse of SRC Data Sets
Collaborating Institutions:
UC Davis and UC San Diego SRP Centers
Abstract:
Most analytical methods used to detect toxicants and biomarkers of exposure can be expensive and complex to utilize. To address this challenge, UC San Diego researchers develop protein biosensors for rapid on-site detection of environmental contaminants such as arsenic, cadmium, and organochlorides. UC Davis researchers design probes to measure specific chemicals and biomarkers of exposure. These tools can be useful for detecting contaminants on-site and for studying human exposure and disease.
The team explored how to re-purpose and integrate their existing datasets to improve biosensor probes used to detect and quantify pollutants in the environment or humans. Through this collaboration, the researchers sought to improve the specificity of their probes and improve their ability to detect new target chemicals in the environment.
Research Question:
How can we re-purpose and integrate our existing probe data to improve the specificity of probes and improve their design to detect new target chemicals in the environment?
Data Sources and Data Sets:
Data from nanobody probe sequences from UC Davis, synthetically evolved receptor and biosensor probes from UC San Diego (see table below).
Institution | Compound | Probe Type | Sample Type | Associated Publications |
UC Davis | TCC | Nanobody | N | |
3-PBA | Nanobody | human urine, environmental and food samples | ||
BDE-47 | Nanobody | furniture samples | ||
TBBPA | Nanobody | spiked soil and serum | ||
sEH | Nanobody | human samples in process | ||
Ochratoxin A | Nanobody | cereal | ||
Carbaryl | Nanobody | spiked cereal | ||
Trizophos | Nanobody | spiked water, soil, apple | ||
Fipronil | Nanobody | serum from exposed animals | ||
CIF | Nanobody | human samples in process | ||
2,4-D | Nanobody | environmental samples | In Progress | |
CNAP | Nanobody | environmental samples | In Progress | |
TETS | Nanobody | exposed animals | In Progress | |
UC San Diego | PBDE-100 | Synthetically Evolved Receptors and Biosensors | environmental samples | In Progress |
Triclosan | Synthetically Evolved Receptors and Biosensors | N | In Progress | |
Phananthrene | Synthetically Evolved Receptors and Biosensors | N | In Progress | |
Polychlorinated biphenyl | Synthetically Evolved Receptors and Biosensors | N | In Progress | |
TCDD | Synthetically Evolved Receptors and Biosensors | N | In Progress | |
PCB-Aroclor | Synthetically Evolved Receptors and Biosensors | N | In Progress |
Data Repositories:
UC San Diego Data Management Portal
Approach:
The research team developed rules for data entry, sharing, formatting, and vocabulary for datasets. They entered data from notebooks into an online database, reformatted their probe sequence data to be interoperable, and established a framework for data storage. The team went further by standardizing their probe sequence data to integrate with data from collaborators. Emphasizing a commitment to others, they engaged in virtual discussions about best practices for data standardization, sharing, and discussed rules and protocols for future data collection and entry.
Outcomes:
The team established a streamlined process to standardize data and developed ground-level infrastructure for sharing probe sequences and functional data. They also established methodologies, protocols, architecture, and standards to be used prior data collection.
Improvement, Harmonization, and Merging of Data Streams Related to DNA Damage
Collaborating Institutions:
Massachusetts Institute of Technology (MIT) and University of New Mexico (UNM) SRP Centers
Abstract:
About one-third of Superfund sites are contaminated with known DNA damaging contaminants. DNA damage can lead to mutations that result in disease. Researchers at the MIT SRP Center developed a high-throughput technology called CometChip, which measures the movement of DNA under electric current to quantify the level of DNA damage. This technology has been widely adopted by researchers, including researchers at the UNM SRP Center, who aim to use the tool to assess DNA damage from metal exposures and explore whether zinc supplementation reduces DNA damage.
The MIT and UNM SRP Centers are working together to harmonize existing data analysis approaches to share data generated from CometChip to better understand how environmental contaminants, such as metals, can modulate DNA damage and repair. The team aimed to merge existing datasets from human and animal models to understand how levels of DNA damage in humans compare to mouse models. They also wanted to understand how CometChip data relates to absolute levels of DNA lesions in human tissues, which could reveal new insight into environmentally induced DNA damage in humans.
Research Question:
How can data generated from the CometChip be used to better understand how environmental contaminants, such as metals, can modulate DNA damage and repair? How does CometChip data relate to absolute levels of DNA lesions in human tissues?
Data Sources and Data Sets:
Institution | Dataset | Description | Variables |
UNM | Navajo Birth Cohort Study: 253 individuals (202 pregnant women and 51 men) | > 10,000 cell cluster images from CometChip DNA damage analysis > Controlled exposure of human cells to DNA oxidation damage using hydrogen peroxide | DNA damage levels |
MIT | CometChip data collected under the auspices of the MIT SRP; in vitro studies using mammalian cells | > 1 million cell cluster images from CometChip DNA damage analysis > standard curves for gamma irradiation to estimate percent comet tail for specific levels of DNA damage | DNA damage levels |
Data Repositories:
No existing repositories available for CometChip data.
Metadata:
Metadata Standards Utilized:
Used existing parameters developed by MIT.
Vocabularies Utilized:
No ontologies or existing vocabularies for CometChip data.
Approach:
The researchers set out to establish parameters and standards for Comet images so that they can be compatible with basic microscopes, establish an open-source image processing and analysis platform, and develop statistical approaches to optimize image analysis. To help users calibrate their data and facilitate data interpretation, the team used their image data to generate a standard curve that compares cell irradiation to DNA strand breaks. They also customized an existing data sharing and management platform, called SEEK, to create a repository to share comet metadata.
Outcomes:
The team successfully combined datasets from cluster images created at MIT and UNM. Using this data they generated standard curves that enable researchers to covert image data into quantified DNA strand breaks, which helps to estimate absolute levels of DNA damage. Leveraging an existing data sharing platform, they created MIT Seek, a platform for sharing Comet metadata between the two centers. They also adapted their existing analysis software for automated data formatting, and to automatically export data analytics, including metadata, into Excel.
They also developed ontologies for comet data, and are working with researchers outside of their EUC, who are currently using the CometChip, to finalize these ontologies.
Integrated Datasets, Portals/Dashboards, Tools, Code:
Once the results from this analysis have been published, the team plans to make their metadata publicly available by transferring it into FAIRDOM Hub, an open-access platform. They will also share the raw data using an open-access hub hosted by MIT.
Development of Interoperable Data Platforms to Define Best Practices and Data Sharing for Flow Cytometry
Collaborating Institutions:
UNM and University of Louisville (U of L) SRP Centers
Abstract:
Flow cytometry, a technique used to detect and measure characteristics of cells, allows unprecedented detail in studies of the immune system and other areas of cell biology. Tens of thousands of cells can be quickly examined, and the data gathered are processed by a computer. However, this type of data can be complex to analyze and harmonize.
Collaborators at the UNM and U of L SRP Centers set out to develop a platform to store and share diverse datasets obtained by flow cytometry, a technique used to detect and measure characteristics of cells. By integrating existing datasets, the team aimed to better understand the effects of chemical exposures on circulating blood cells that cause immune injury or cardiovascular disease. Integrating existing datasets will also allow the researchers to apply flow data more broadly to understand if animal models can predict human immune responses to environmental exposures.
Research Question:
Can changes in flow data markers be linked to environmental exposures? Does flow data give insights into mechanisms of toxicity? Can new biomarkers be discovered? Can flow data be used to support environmental projects to guide an understanding of needed remediation efforts? Do results in different mice species predict human responses?
Data Sources and Data Sets:
Institution | Organism | Dataset | Data type |
UNM | Mouse | > 100 mice for evaluation of cell surface marker expression and subset analysis (T, B, NK, Mɸ, erythroid markers from bone marrow, spleen, and thymus) for evaluation of potential mechanisms for immunotoxicity of uranium and arsenic. | Flow cytometry |
U of L | Human | 316 subjects assessed for 15 types of circulating angiogenic cells and platelet aggregates to determine the impact exposure to volatile organic compounds on cardiovascular disease. | Flow cytometry |
Data Repositories:
The Environmental Data Initiative (EDI) Data Portal
Metadata:
Metadata Standards Utilized:
Used existing parameters from data collection from individual experiments.
Vocabularies Utilized:
They looked at the Open Biological and Biomedical Ontology Foundry website to identify a potential ontology for flow data.
Approach:
The team developed a template for sharing flow metadata based on a MiflowCyt template. Their goal was to produce a structured form that can easily be incorporated into the data collection process and facilitate the integration of data. Then, they completed these forms and uploaded them into a test portal on the EDI Staging Environment to share with each other and facilitate data analysis. They looked at the Open Biological and Biomedical Ontology Foundry website to identify a potential ontology for flow data.
Outcomes:
The team was successful in creating the template for flow cytometry metadata, which described parameters such as type of instrumentation used to collect data, instrumentation voltage, how data was analyzed, specific nomenclature, and file names. They uploaded this template into the public UNM portal, completed the form for their individual datasets and uploaded their raw dataset to share and facilitate data interpretation and analysis. The researchers were successful in accessing and analyzing each other’s data.
They also established the Cell Ontology as a starting point to develop an ontology for flow cytometry data. They identified that FlowCL, a software package that performs semantic labelling of cell populations, would expand the ontology even more and would help reference it to certain cell populations.
Integrated Datasets, Portals/Dashboards, Tools, Code:
Completed forms, metadata, and templates were uploaded into a test portal on the EDI Staging Environment. UNM is creating a website for easy access to this portal. At UofL, the team plans to create their own portal and website.
Improving Synchrotron-Based Data Access, Analysis and Workflows: Measuring the Concentration, Speciation and Distribution of Contaminants in Environmental and Biomedical Matrices
Collaborating Institutions:
Columbia University, University of Arizona (UA), Dartmouth College, and UNM SRP Centers
Abstract:
Researchers collaborated to explore how Synchrotron-based spectroscopic data from their SRP Centers can be combined to better understand chemical speciation and the environmental and biochemical factors that control chemical form, retention, transport, and distribution of contaminants in diverse samples.
Synchrotrons use electrons accelerated to near light speed and steered by magnets to create beams of light that cause the chemical elements within a sample to fluoresce. Synchrotron data provides elemental abundance, distribution, and speciation data and is a highly sought-after technique. While researchers are usually only interested in data for a few specific elements in elemental mapping, the synchrotron collects a full spectrum, providing information on a broad range of elements. Data for elements outside of the investigator’s initial hypothesis are never re-used, despite their potential value. Archiving this wealth of data would allow more researchers to leverage valuable data that is challenging to obtain.
By developing a series of spectral databases and verifiable and traceable reference materials with appropriate metadata, the team aimed to automate, integrate, and improve synchrotron data analysis to better quantify the distribution of contaminant species in environmental samples and within biological tissues.
Research Question:
How can we use synchrotron-based spectroscopic data to understand chemical speciation, and thus fate, transport, and toxicity of environmental contaminants? Can we automate, integrate, and improve synchrotron data analysis?
Data Sources and Data Sets:
Data from individual researchers at SRP Centers, including environmental spectra, element-specific images and spectra, reference materials, spectra from experiments, and space and time markers.
Collaborator | Project | # of Samples | # of References | Example Variables |
Columbia | As remediation of NPL sites (2007-present) | >200 | >200 | As, Fe, Mn references characteristic of neutral pH environments and aquifers, NPL-site characterization |
As mitigation in Bangladesh (2006-present) | >1000 | >200 | As, Fe, and Mn speciation in natural environmental sediments/soils | |
UA | Phytoremediation of mine tailings (2005-present) | >1000 | >200 | As, Pb, U, Zn, Fe, Mn, and S reference compounds characteristic of sulfide ore and mine-impacted soils/sediments/plant tissues |
Dartmouth | Trace Core (2006-2017) | >500 | >100 | Synchrotron and elemental imaging data spanning, animal and human tissue specimens, related laser ablation ICP-MS |
UNM | As/U immobilization (2011-present) | >100 | >100 | U, As, V, Fe and Mn reference spectra for natural and synthetic minerals and mineral mixtures typical of mining impacted areas |
Data Repositories:
U.S. Geological Survey, GitHub
Metadata:
Vocabularies Utilized:
No ontologies available for synchrotron data.
Approach:
The research team performed partial processing exercises, linking metadata for different types of experiments to the data, environmental samples, and reference materials. They developed small pieces of code to allow a single software package to process diverse data while also implementing a universal data format. They developed standard operating procedures for data collection and processing and used the Sam's MicroAnalysis toolKit (SMAK) program to facilitate analyses across imaging data.
Outcomes:
They created user-friendly web-based workflows within a storage system, called the Biological Elemental Imaging Database (BEID), for microprobe analysis and laser ablation Inductively Coupled Plasma Mass Spectrometry data, and a related website and interoperability widget. While the database is not currently publicly available, BEID will ensure uniformity in data analysis, allow users to directly upload to the database, and conduct automated quality checks on submitted data.
The team integrated data on reference standards for iron and arsenic. Furthermore, they worked to create unsupervised spectral analysis tools that integrate statistical clustering with important environmental outputs like drinking water quality. The team was even able to uncover new information on factors that control arsenic levels and toxicity in rice, such as levels of iron and zinc.
Publications:
Nghiem AA, Shen Y, Stahl M, Sun J, Haque E, DeYoung B, et al. 2020. Aquifer-scale observations of iron redox transformations in arsenic-impacted environments to predict future contamination. Environ Sci Technol Lett 7(12):916-922.