Leveraging Data Across Human Populations

JavaScript is disabled

Webcasts and videos will not work. Visit this guide for steps on enabling JavaScript.

Researchers combined data from human populations to better understand the connections between exposure to hazardous substances and health. For definitions of common data science and sharing terms, see the glossary on the landing page. For more information about these use cases, please refer to the White Paper (-1B).

Arsenic Mass Balance: Integrating Environmental and Biomarker Data Across Diverse Populations

Collaborating Institutions:

Columbia University, the University of California (UC) Berkeley, and the University of New Mexico (UNM) SRP Centers

Abstract:

Arsenic, naturally found in earth’s crust, is known to cause a variety of health problems in humans. Health risk estimates are currently based on drinking water exposure but, depending on the location, other sources are also relevant, including food and potentially dust and air, for example, in regions where inorganic arsenic is a common component of mining waste.

A team of researchers from the Columbia University, UC Berkeley, and UNM SRP Centers SRP Centers worked together to understand how comparing exposure and excretion across populations can create a more complete picture of potential sources of arsenic. These centers explore the effects of arsenic from different sources in populations in Bangladesh, Chile, and the Navajo Nation in the U.S., respectively. The Use Case sought to collectively analyze arsenic measurements in biological samples, like urine, and environmental samples, like water and dust.

Research Question:

Can arsenic measured in environmental and human samples be combined to get a more complete picture of exposure levels?

Data Sources and Data Sets:

Institution	Study population	Biological samples (n)	Environmental variables (n)
Institution	Study population	Urine	Water	Dust	Air
Columbia	11,224 adults from Bangladesh	11,226	11,751	-	-
UC Berkeley	630 adults from Northern Chile	610	610	-	585
UNM	629 adults from Navajo Nation	619	619	619	-

Metadata:

Metadata Standards Utilized:

The team leveraged Ecological Metadata Language to standardize their metadata.

Vocabularies Utilized:

The team leveraged the Medical Subject Headings (MeSH) to standardize vocabulary.

Approach:

The team first created a searchable data dictionary, which described common project-specific terminology including specific names, definitions, and attributes about data elements. Each study center developed their own data dictionary and then integrated the information into one large ontology. They utilized Medical Subject Headings (MeSH) to standardize vocabulary.

They engaged in biweekly calls to harmonize data following their data dictionary. They shared data via Google Drive for analysis. They exported visualization results from their analysis as a Scalable Vector Graphic (SVG), which allowed them to adjust and scale their data while maintaining data quality and integrity.

Outcomes:

The team was successful in harmonizing and combining their data to evaluate the relationship between environmental arsenic and urinary arsenic and implement a mass balance approach. The mass balance model follows the principle that the amount of arsenic entering the body should equal the amount that exits the body, in this case quantified from urinary arsenic. By combining data from the three populations, the researchers were able to bring intake and excretion arsenic closer to mass-balance by considering intake of arsenic beyond the primary source. Results from this analysis allowed a better understanding of the relationships between different sources of arsenic, some which the researchers had not considered before, and human exposure.

Integrated Datasets, Portals/Dashboards, Tools, Code:

Analytical code and findings from analysis can be accessed via GitHub. Raw data cannot be shared due to privacy restrictions.

Data Harmonization Across SRP Pregnancy and Birth Cohorts

Collaborating Institutions:

UNM, Dartmouth College, and Northeastern University Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) SRP Centers

Abstract:

Adverse pregnancy outcomes, like preterm birth and low birth weight, are a significant global public health challenge. Rates of adverse pregnancy outcomes are higher near hazardous waste sites or other sources of environmental pollution.

Collaborators from the UNM, Dartmouth College, and Northeastern University SRP Centers investigated whether their biomonitoring, demographic, and environmental data could be integrated across three populations to better understand the effect of environmental exposures on birth outcomes. The team wanted to create a data and methodology infrastructure that could serve as a foundation for current and future studies looking at common toxicants across populations, and at common outcomes of concern across contaminant classes and across populations. They aimed to integrate exposure metadata collected across the three cohorts and to develop a secure web platform to explore associations between exposure biomarkers and outcomes at birth, making more accurate predictions. Their approach was designed to determine the variance introduced in analyses by differences across populations, laboratory methods, classes of toxicants, and collection protocols. These are all critical to interpretating results and developing protocols to standardize the process.

Research Question:

Can biomonitoring data and other demographic and exposure data be integrated to study the effect of environmental exposures on multiple outcomes?

Data Sources and Data Sets:

	Northeastern University	Dartmouth College	University of New Mexico (METALS)
Cohort	PROTECT Cohort	New Hampshire Birth Cohort Study	Navajo Birth Cohort Study; Thinking Zinc Study
Community	Northern Puerto Rico	Rural New England	Navajo Nation (Indigenous)
Questionnaire Data	Demographics, socioeconomics, behavioral, medical history, diet, maternal stress	Demographics, socioeconomics, lifestyle, medical history, diet, supplement use, occupation, drinking water, and other exposures	Demographics, socioeconomics, diet; home construction; occupational information; activity and resource use; drinking water source; vitamin supplement use
Chemical exposure data	Phthalates	Arsenic and other nutrient and toxic elements	Uranium and mixed metals from mine waste
Biological and environmental samples	Urine, blood	Toenail clippings, urine, drinking water	Urine, blood, drinking water
Health outcome data	Gestational age, birth weight, birth length, head circumference, birth anomalies/ defects, type of delivery, certain cytokines	Gestational age, birth weight, birth length, birth head circumference, birth anomalies/defects, type of delivery, Apgar scores, maternal/infant infections, labor course; certain cytokines	SRP outcomes on adults in the intervention include cytokines, lymphocyte profiles, DNA damage, antinuclear antibodies (ANA)
No. of participants	1,450+ pregnant women enrolled with 1,200+ live births to date	2,010 pregnant women with urinary metals assay results collected at ~24-28 weeks of gestation with 1,877 born as of date of the dataset compilation	780 pregnant women in birth cohort test case, cytokines, and other outcomes on ~200 to date, Zinc trial target 100
Data dictionary	PROTECT Data Dictionary	Dartmouth Data Dictionary	The database for the zinc study is still in development: birth cohort information is available through data managers directly

Metadata:

Vocabularies Utilized:

The team created their own expanded and harmonized data dictionary that allowed them to combine data across cohorts.

Approach:

To harmonize and integrate their data, the team first evaluated each cohort’s data dictionaries to map and align the common variables. Their expanded and harmonized data dictionary allowed them to combine data across cohorts.

Given data privacy issues associated with data from the Navajo Nation METALS cohort, the team developed a secure data analysis framework that would be hosted at the UNM. They leveraged several open-source frameworks, including Django, a web and Python-based analytical tool, an application gateway called nginx, and Docker as a containerization software. This web-accessible secure processing platform can facilitate a wealth of new and future scientific discoveries as well as potential cost savings.

Outcomes:

The resultant platform has been used to perform several statistical analyses, create graphics, and test project hypotheses. For example, the team investigated the association between exposure to arsenic during pregnancy using maternal urinary arsenic concentrations and birth outcomes, including gestational age, birth weight, and head circumference. Analysis can be performed within a single cohort, or across cohorts. Preliminary analyses have revealed that higher arsenic exposure during pregnancy is associated with lower birthweight, and the team is beginning to characterize the variation in this relationship.

The team is continuing their analysis of the harmonized data sets and are working toward a joint paper on the arsenic study. The team is also interested in sharing their methodology and tools with other researchers who have significant privacy challenges associated with their cohorts.

Integrated Datasets, Portals/Dashboards, Tools, Code:

Analytical code and findings from analysis can be accessed via GitHub. Raw data cannot be shared due to privacy restrictions.

Arsenic Epigenetics META: Towards a Meta-Analysis of Epigenome Data on Arsenic

Collaborating Institutions:

UC Berkeley and Columbia University SRP Centers

Abstract:

A team of researchers from the UC Berkeley and Columbia University SRP Centers worked together to enable meta-analyses of multiple Epigenome-Wide Association Studies (EWAS) related to environmental arsenic exposures. Epigenetic changes alter gene expression without directly altering DNA sequences and might serve as biomarkers of environmental exposures. EWAS use genome-wide assays of epigenetic marks, such as DNA methylation, to identify associations between phenotypes or exposures and epigenetic variation across the genome. These studies provide unique insight into the role of the environment on human health, but most studies of arsenic exposure to date have worked with small sample sizes and utilize diverse data processing and analytical methods yielding different results.

The team aimed to pool EWAS studies in two different populations to determine if the influence of arsenic exposure on the epigenome is generalizable across study populations and tissues. Arsenic-related epigenetic dysregulation may provide information about biological pathways linking arsenic to health outcomes and provide a biomarker of previous exposure and disease risk.

Research Question:

Is the influence of arsenic exposure on the epigenome generalizable across study populations?

Data Sources and Data Sets:

Institution	Dataset Description
Columbia	Data from 80 participants from the Health Effects of Arsenic Longitudinal Study (HEALS) cohort study. Study participants were classified as having low Arsenic (As) exposure and high As exposure based on drinking water As concentrations. Datasets include DNA methylation data (Illumina’s 450K array or the HumanMethylationEPIC BeadChip 850k), extensive biomarkers of As exposure (drinking water As, urinary As from multiple time points, blood As, As metabolites in blood and in urine), data on potential co-exposures, demographic information, and nutrition information.
UC Berkeley	Illumina HumanMethylationEPIC BeadChip array data from 40 participants from an adult cohort in Northern Chile where half of the subjects had been exposed to very high levels of naturally occurring arsenic in drinking water as children. Hundreds of arsenic measurements in drinking water are available for over the past 60 years. Data from Buccal and blood cell samples were available.

Metadata:

Metadata Standards Utilized:

Vocabularies Utilized:

The team created their own expanded and harmonized data dictionary that allowed them to combine data across cohorts.

Approach:

The team established consistent classification of exposure across datasets and quality control and data preprocessing steps to facilitate integration. Pre-processing included data normalization, quality control, and cleaning. They standardized these steps across SRP centers by working collaboratively via GitHub.

Specifically, they created a workflow and protocol beginning with raw image data files obtained from the DNA methylation array technology. Each center performed data processing and conducted EWAS locally, and center-specific code was deposited to GitHub. The team leveraged R packages available through Bioconductor, a free, open-source, and open-development software project to facilitate reproducible research.

Outcomes:

In individual EWAS, the team did not find any common differentially methylated positions. Meta-analyzing their results increased statistical power to identify significant common findings. For example, their meta-analysis identified three differentially methylated positions and nineteen differentially variable positions. In KEGG biological pathway analysis, differentially methylated and variable positions in the genome were related to pathways with potential biological relevance to arsenic exposure like one-carbon pool by folate. One-carbon metabolism is responsible for synthesizing the methyl donor in arsenic metabolism, a process which facilitates urinary excretion and reduces arsenic toxicity.

Complete results from individual EWAS are available on GitHub and will be uploaded to an Open Science Framework (OSF) repository to facilitate comparison with other EWAS. Others can utilize the code for their own data and the team’s written protocol can be used as template for other collaborations.

The UC Berkeley and Columbia SRP Centers have also become closer collaborators over the course of the project. They held a virtual symposium to facilitate and build on their collaboration and plan to explore additional research questions, such as investigating chronic versus acute arsenic epigenetic signatures. They also plan to expand this project to other cohorts to reveal new insights.

Publications:

Bozack AK, Boileau P, Wei L, Hubbard AE, Sillé FCM, Ferreccio C, Acevedo J, Hou L, Ilievski V, Steinmaus CM, Smith MT, Navas-Acien A, Gamble MV, Cardenas A. 2021. Exposure to arsenic at different life-stages and DNA methylation meta-analysis in buccal cells and leukocytes. Environ Health 20(1):79. [Abstract] [FullText]

Integrated Datasets, Portals/Dashboards, Tools, Code:

Analytical code and findings from analysis can be accessed via GitHub and OSF. Raw data cannot be shared due to privacy restrictions.

National Institute of Environmental Health Sciences

Webcasts

Your Environment. Your Health.

Leveraging Data Across Human Populations

Arsenic Mass Balance: Integrating Environmental and Biomarker Data Across Diverse Populations

Collaborating Institutions:

Abstract:

Research Question:

Data Sources and Data Sets:

Metadata:

Metadata Standards Utilized:

Vocabularies Utilized:

Approach:

Outcomes:

Integrated Datasets, Portals/Dashboards, Tools, Code:

Data Harmonization Across SRP Pregnancy and Birth Cohorts

Collaborating Institutions:

Abstract:

Research Question:

Data Sources and Data Sets:

Metadata:

Vocabularies Utilized:

Approach:

Outcomes:

Integrated Datasets, Portals/Dashboards, Tools, Code:

Arsenic Epigenetics META: Towards a Meta-Analysis of Epigenome Data on Arsenic

Collaborating Institutions:

Abstract:

Research Question:

Data Sources and Data Sets:

Metadata:

Vocabularies Utilized:

Approach:

Outcomes:

Publications:

Integrated Datasets, Portals/Dashboards, Tools, Code: