Overview
#
Pan-ancestry GWAS of UK BiobankHere, we present a multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. We develop a stringent quality control pipeline, identifying variants that are discrepant with gnomAD frequencies, and make recommendations for filtering these and other GWAS results.
#
Multi-ancestry analysisParticipants have been divided into ancestry groups to account for population stratification in GWAS analyses. Throughout these docs, these ancestry groupings are referred to by 3-letter ancestry codes derived from or closely related to those used in the 1000 Genomes Project and Human Genome Diversity Panel, as follows:
EUR
= European ancestryCSA
= Central/South Asian ancestryAFR
= African ancestryEAS
= East Asian ancestryMID
= Middle Eastern ancestryAMR
= Admixed American ancestry
These codes refer only to ancestry groupings used in GWAS, not necessarily other demographic or self-reported data.
#
Release dataWe release the summary statistics in two formats:
For one or a few phenotypes, we recommend using the phenotype-specific flat files: see further description here.
For analysis the full dataset (all phenotypes, all populations), the summary statistics are available in Hail formats: see further description here.
#
ApproachAnalysis was done using SAIGE implemented in Hail Batch to parallelize across populations, phenotypes, and regions of the genome. More details can be found below:
- Details about the QC process can be found here including determination of ancestry groups.
- Description of GWAS pipeline and implementation can be found on our Github.
The sample size for each population and the number of phenotypes run is as follows:
Population | Num. Individuals | Num. Phenotypes |
---|---|---|
AFR | 6636 | 2493 |
AMR | 980 | 1105 |
CSA | 8876 | 2771 |
EAS | 2709 | 1612 |
EUR | 420531 | 7200 |
MID | 1599 | 1372 |
Each phenotype may have fewer samples run, depending on data missingness, which can be found in the phenotype manifest, or n_cases
and n_controls
in the Hail MatrixTable.