Close the left navigation

PEGS Data Freezes

Genotype-Phenotype-Environment representation of PEGS

The PEGS data are stored securely in a single centralized, shared repository to ensure consistent, reproducible, and comparable analyses. PEGS comprises a compatible, multi-dimensional collection of datasets in consistent and programmatically extractable formats, as shown in the figure on the right, and Data Components below. PEGS data are updated on a quarterly basis with additional participants, new variables, participant updates, and any additional data components. We are continually building analysis pipelines and workflows to enable efficient, reproducible, insightful, and collaborative research using the PEGS data.

Data Components

Data components available to researchers from the PEGS cohort are listed with their description and sample size (the number of participants). The latest versions of the administered participant surveys are also provided.

CategoryComponentDescriptionDocumentsNumber of Participants
Survey DataDemographic and Administrative DataDemographics, consent, address and administrative data for all participants 19,445
 Health & Exposure SurveyDemographics, health, family history of disease, environmental exposures, socioeconomic status and lifestyleHealth & Exposure Survey (338KB)9,449
 External Exposome Survey (Exposome A)Residential and occupational environmental exposuresExternal Exposome Survey (27MB)3,618
 Internal Exposome Survey (Exposome B)Medication use, physical activity, stress, sleep, diet, genetics and reproductive historyInternal Exposome Survey (13MB)3,071
 Diabetes Screener SurveyDiabetes screener administered to participants with self-reported diabetesDiabetes Screener Survey (69KB)227
 Eczema Screener SurveyEczema screener administered to participants with self-reported eczemaEczema Screener Survey (92KB)329
 Right-not-to-know Main SurveyRight-not-to-know Survey administered for incidental findings reports 231
 Right-not-to-know Cognitive Interview SurveyRight-not-to-know Cognitive Interview administered to assess awareness of incidental findings reportsRight-not-to-know Cognitive Interview Survey (1MB)12
Medication DataAnatomical Therapeutic Chemical (ATC) CodesATC codes for self-reported free-text medication names from the Internal Exposome Survey (Exposome B) as per the World Health Organization's (WHO's) ATC classification system 2,263
Geospatial DataGeocodes (GIS)Geocoded participant addresses from five study events with mapping coordinates 18,462
 Hazards DataExposure estimates and proximity measures calculated using geospatial linkages from the following databases - Atmospheric Composition Analysis Group (ACAG), Toxics Release Inventory (TRI), Center for Air, Climate, and Energy Solutions (CACES), North Carolina Department of Environmental Quality (NCDEQ), Department of Transportation (DOT), Federal Aviation Administration (FAA), Federal Communications Commission (FCC) and the Nuclear Regulatory Commission (NRC) 18,462
 MERRA-2 Data (Earthdata)Geospatial data linkages from the Modern Era Retrospective analysis for Research and Applications (MERRA-2) project containing consistent estimates of climate and environmental metrics from a range of satellite-based environmental observations 17,273
 Social Vulnerability Index (SVI) DataGeospatial data linkages for CDC/ATSDR Social Vulnerability Index containing summaries of social determinants of health at the census tract level 17,273
 Environmental Justice Index (EJI) DataGeospatial data linkage for CDC/ATSDR Environmental Justice Index containing summaries of environmental, social, and health factors at the census tract level 17,273
Genomic DataCandidate Gene/SNP DataCandidate SNP data for a subset of participants for specific research goals 12,316
 Single Nucleotide Variants (SNVs)SNV and small indel genotypes derived from the whole-genome sequencing (WGS) data in plink's .bed/.bim/.fam format 4,737
 Structural VariantsStructural variant calls generated from the WGS data in .vcf format consisting of large deletions, duplications, and inversions 4,737
 Human Leukocyte Antigens (HLA) GenotypesHLA genotypes identified from the WGS data for 20 HLA genes with up to six digits of specificity 4,737
 Telomeric ContentAggregate telomeric content estimated from WGS reads reported as telomeric reads per GC content-matched million reads 4,737
 Local and Global Ancestry EstimationsInferred local ancestry per chromosome after haplotype phasing and global estimates of percent ancestry for each participant 4,730
 Methylation DataGenome-wide methylation profiling data using the Infinium MethylationEPIC v1.0 BeadChip Kit targeting 866,297 CpG sites 4,724

Survey Summary

Categories of survey questions administered to the participants in the Health & Exposure Survey are provided.

Health & Exposure Survey
About Your Family's HealthDiabetes and EndocrineNeurologic
About Your General HealthDigestiveOccupation
About Your Home LifeExposuresRenal
About Your MoodFatigueReproductive (Females Only)
Bones, Joints, and MusclesHematologicalReproductive (Males Only)
CancerImmuneRespiratory
CardiovascularLifestyleSkin, Eyes, and Hair

Categories of survey questions administered to the participants in the External Exposome Survey (Exposome Survey - Part A) are provided.

External Exposome (Exposome A)
Characteristics of Current and Past Residences:
• Agricultural Property Use
• Garage and Basement
• Heating and Cooling
• Pesticides and Insecticides
• Pets
• Surrounding Area
• Walls and Flooring
• Water and Dampness
Chemical and Metal Exposures at Work
Hobby Exposures
Ultraviolet Light Exposures
Workplace Characteristics

Categories of survey questions administered to the participants in the Internal Exposome Survey (Exposome Survey - Part B) are provided.

Internal Exposome (Exposome B)
Chemotherapy/Radiation TherapyPhysical Activity
Dietary BehaviorReproductive History (Females Only)
Dietary IntakeSleep
Genetic HistoryStress
Infectious DiseaseVitamins, Minerals, and Other Supplement Use
MedicationsTwin/Triplet Siblings and Birth Order
Other

Geospatial Data Summary

SourceDescriptionExamples
Geocodes (GIS)Geocoded data from multiple participant-provided addresses from time of: initial enrollment, completion of the Health and Exposure Survey, completion of the External Exposome Survey and the longest-lived childhood address and the longest-lived adult address from the External Exposome.Geographic coordinates (latitude and longitude) from multiple participant-provided addresses.
HazardsExposure estimates computed from Department of Transportation (DOT) data.Information from train tracks, rail depots and roadways, such as total major roadway length, distance to nearest rail depot, etc.
HazardsExposure estimates computed from Federal Aviation Administration (FAA) data.Information from aircraft departure and arrival sites - e.g., distance to nearest airport.
HazardsExposure estimates computed from Federal Communications Commission (FCC) data.Information from cellular network towers - e.g., nearest cell tower.
HazardsExposure estimates computed from North Carolina Department of Environmental Quality (NCDEQ).Distance to multi-pollutant point sources such as swine CAFOs, hazardous waste site, hazardous spill site, EPA superfund site, wastewater treatment plant release site, etc.
HazardsExposure estimates computed from Nuclear Regulatory Commission (NRC) data.Distance to nuclear power station.
HazardsExposure estimates computed from Atmospheric Composition and Analysis Group (ACAG) data.Particulate matter concentrations - PM2.5 total, PM2.5 sulfate, PM2.5 black carbon, etc.
HazardsExposure estimates computed from Center for Air, Climate, and Energy Solutions (CACES) data.Concentrations for multiple pollutants such as carbon monoxide, nitrogen dioxide, ozone concentration, etc.
HazardsExposure estimates computed from Toxics Release Inventory (TRI) data.Emissions for chemicals of interest such as benzene, ethylbenzene, xylene, toluene, etc.
MERRA-2 data (Earthdata)Geospatial data linkages from the Modern Era Retrospective analysis for Research and Applications (MERRA-2) project to assimilate a range of satellite-based environmental observations into a consistent estimate of climate and environmental metrics.Particulate, gas, meteorological, and health-relevant exposure indicators such as - dust sedimentation, organic carbon emission bin, SO2 biomass burning emissions, sea-level pressure, etc.
Social Vulnerability Index (SVI)Geospatial data linkages for CDC/ATSDR Social Vulnerability Index, designed to consistently quantify multiple social determinants of health across the United States over time.Consists of summaries of social determinants of health at the census tract level including an overall index, four component indexes (socioeconomic status, household characteristics, racial and ethnic minority status, and housing type/transportation), and source variables used to compute each index component (e.g., poverty, education, overcrowding, access to vehicle, etc.)
Environmental Justice Index (EJI)Geospatial data linkages for CDC/ATSDR Environmental Justice Index, containing summaries and ranks of the cumulative impacts of environmental injustice on health at the census tract level.Consists of ranks for each census tract on 36 environmental, social, and health factors grouped into ten domains and three overarching modules - the environmental burden, social vulnerability and health vulnerability modules.


All data on this website are reported from PEGS Data Freeze 3.1 created on 6/27/2023.