PEGS Data Freezes

Genotype-Phenotype-Environment representation of PEGS

The PEGS data are stored securely in a single centralized, shared repository to ensure consistent, reproducible, and comparable analyses. PEGS comprises a compatible, multi-dimensional collection of datasets in consistent and programmatically extractable formats, as shown in the figure on the right, and Data Components below. PEGS data are updated on a quarterly basis with additional participants, new variables, participant updates, and any additional data components. We are continually building analysis pipelines and workflows to enable efficient, reproducible, insightful, and collaborative research using the PEGS data.

Data Components

Data components available to researchers from the PEGS cohort are listed with their description and sample size (the number of participants). The latest versions of the administered participant surveys are also provided.

Category Component Description Documents Number of Participants
Survey Data Demographic and Administrative Data Demographics, consent, address and administrative data for all participants   19,445
  Health & Exposure Survey Demographics, health, family history of disease, environmental exposures, socioeconomic status and lifestyle Health & Exposure Survey (338KB) 9,449
  External Exposome Survey (Exposome A) Residential and occupational environmental exposures External Exposome Survey (27MB) 3,618
  Internal Exposome Survey (Exposome B) Medication use, physical activity, stress, sleep, diet, genetics and reproductive history Internal Exposome Survey (13MB) 3,071
  Diabetes Screener Survey Diabetes screener administered to participants with self-reported diabetes Diabetes Screener Survey (69KB) 227
  Eczema Screener Survey Eczema screener administered to participants with self-reported eczema Eczema Screener Survey (92KB) 329
  Right-not-to-know Main Survey Right-not-to-know Survey administered for incidental findings reports   231
  Right-not-to-know Cognitive Interview Survey Right-not-to-know Cognitive Interview administered to assess awareness of incidental findings reports Right-not-to-know Cognitive Interview Survey (1MB) 12
Medication Data Anatomical Therapeutic Chemical (ATC) Codes ATC codes for self-reported free-text medication names from the Internal Exposome Survey (Exposome B) as per the World Health Organization's (WHO's) ATC classification system   2,263
Geospatial Data Geocodes (GIS) Geocoded participant addresses from five study events with mapping coordinates   18,462
  Hazards Data Exposure estimates and proximity measures calculated using geospatial linkages from the following databases - Atmospheric Composition Analysis Group (ACAG), Toxics Release Inventory (TRI), Center for Air, Climate, and Energy Solutions (CACES), North Carolina Department of Environmental Quality (NCDEQ), Department of Transportation (DOT), Federal Aviation Administration (FAA), Federal Communications Commission (FCC) and the Nuclear Regulatory Commission (NRC)   18,462
  MERRA-2 Data (Earthdata) Geospatial data linkages from the Modern Era Retrospective analysis for Research and Applications (MERRA-2) project containing consistent estimates of climate and environmental metrics from a range of satellite-based environmental observations   17,273
  Social Vulnerability Index (SVI) Data Geospatial data linkages for CDC/ATSDR Social Vulnerability Index containing summaries of social determinants of health at the census tract level   17,273
  Environmental Justice Index (EJI) Data Geospatial data linkage for CDC/ATSDR Environmental Justice Index containing summaries of environmental, social, and health factors at the census tract level   17,273
Genomic Data Candidate Gene/SNP Data Candidate SNP data for a subset of participants for specific research goals   12,316
  Single Nucleotide Variants (SNVs) SNV and small indel genotypes derived from the whole-genome sequencing (WGS) data in plink's .bed/.bim/.fam format   4,737
  Structural Variants Structural variant calls generated from the WGS data in .vcf format consisting of large deletions, duplications, and inversions   4,737
  Human Leukocyte Antigens (HLA) Genotypes HLA genotypes identified from the WGS data for 20 HLA genes with up to six digits of specificity   4,737
  Telomeric Content Aggregate telomeric content estimated from WGS reads reported as telomeric reads per GC content-matched million reads   4,737
  Local and Global Ancestry Estimations Inferred local ancestry per chromosome after haplotype phasing and global estimates of percent ancestry for each participant   4,730
  Methylation Data Genome-wide methylation profiling data using the Infinium MethylationEPIC v1.0 BeadChip Kit targeting 866,297 CpG sites   4,724

Survey Summary

Categories of survey questions administered to the participants in the Health & Exposure Survey are provided.

Health & Exposure Survey
About Your Family's Health Diabetes and Endocrine Neurologic
About Your General Health Digestive Occupation
About Your Home Life Exposures Renal
About Your Mood Fatigue Reproductive (Females Only)
Bones, Joints, and Muscles Hematological Reproductive (Males Only)
Cancer Immune Respiratory
Cardiovascular Lifestyle Skin, Eyes, and Hair

Categories of survey questions administered to the participants in the External Exposome Survey (Exposome Survey - Part A) are provided.

External Exposome (Exposome A)
Characteristics of Current and Past Residences:
• Agricultural Property Use
• Garage and Basement
• Heating and Cooling
• Pesticides and Insecticides
• Pets
• Surrounding Area
• Walls and Flooring
• Water and Dampness
Chemical and Metal Exposures at Work
Hobby Exposures
Ultraviolet Light Exposures
Workplace Characteristics

Categories of survey questions administered to the participants in the Internal Exposome Survey (Exposome Survey - Part B) are provided.

Internal Exposome (Exposome B)
Chemotherapy/Radiation Therapy Physical Activity
Dietary Behavior Reproductive History (Females Only)
Dietary Intake Sleep
Genetic History Stress
Infectious Disease Vitamins, Minerals, and Other Supplement Use
Medications Twin/Triplet Siblings and Birth Order
Other

Geospatial Data Summary

Source Description Examples
Geocodes (GIS) Geocoded data from multiple participant-provided addresses from time of: initial enrollment, completion of the Health and Exposure Survey, completion of the External Exposome Survey and the longest-lived childhood address and the longest-lived adult address from the External Exposome. Geographic coordinates (latitude and longitude) from multiple participant-provided addresses.
Hazards Exposure estimates computed from Department of Transportation (DOT) data. Information from train tracks, rail depots and roadways, such as total major roadway length, distance to nearest rail depot, etc.
Hazards Exposure estimates computed from Federal Aviation Administration (FAA) data. Information from aircraft departure and arrival sites - e.g., distance to nearest airport.
Hazards Exposure estimates computed from Federal Communications Commission (FCC) data. Information from cellular network towers - e.g., nearest cell tower.
Hazards Exposure estimates computed from North Carolina Department of Environmental Quality (NCDEQ). Distance to multi-pollutant point sources such as swine CAFOs, hazardous waste site, hazardous spill site, EPA superfund site, wastewater treatment plant release site, etc.
Hazards Exposure estimates computed from Nuclear Regulatory Commission (NRC) data. Distance to nuclear power station.
Hazards Exposure estimates computed from Atmospheric Composition and Analysis Group (ACAG) data. Particulate matter concentrations - PM2.5 total, PM2.5 sulfate, PM2.5 black carbon, etc.
Hazards Exposure estimates computed from Center for Air, Climate, and Energy Solutions (CACES) data. Concentrations for multiple pollutants such as carbon monoxide, nitrogen dioxide, ozone concentration, etc.
Hazards Exposure estimates computed from Toxics Release Inventory (TRI) data. Emissions for chemicals of interest such as benzene, ethylbenzene, xylene, toluene, etc.
MERRA-2 data (Earthdata) Geospatial data linkages from the Modern Era Retrospective analysis for Research and Applications (MERRA-2) project to assimilate a range of satellite-based environmental observations into a consistent estimate of climate and environmental metrics. Particulate, gas, meteorological, and health-relevant exposure indicators such as - dust sedimentation, organic carbon emission bin, SO2 biomass burning emissions, sea-level pressure, etc.
Social Vulnerability Index (SVI) Geospatial data linkages for CDC/ATSDR Social Vulnerability Index, designed to consistently quantify multiple social determinants of health across the United States over time. Consists of summaries of social determinants of health at the census tract level including an overall index, four component indexes (socioeconomic status, household characteristics, racial and ethnic minority status, and housing type/transportation), and source variables used to compute each index component (e.g., poverty, education, overcrowding, access to vehicle, etc.)
Environmental Justice Index (EJI) Geospatial data linkages for CDC/ATSDR Environmental Justice Index, containing summaries and ranks of the cumulative impacts of environmental injustice on health at the census tract level. Consists of ranks for each census tract on 36 environmental, social, and health factors grouped into ten domains and three overarching modules - the environmental burden, social vulnerability and health vulnerability modules.


All data on this website are reported from PEGS Data Freeze 3.1 created on 6/27/2023.