Overview

Pan-ancestry GWAS of UK Biobank#

Here, we present a multi-ancestry analysis of 7,228 phenotypes using a generalized mixed model association testing framework, spanning 16,131 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. We develop a stringent quality control pipeline, identifying variants that are discrepant with gnomAD frequencies, and make recommendations for filtering these and other GWAS results.

Multi-ancestry analysis#

Participants have been divided into ancestry groups to account for population stratification in GWAS analyses. Throughout these docs, these ancestry groupings are referred to by 3-letter ancestry codes derived from or closely related to those used in the 1000 Genomes Project and Human Genome Diversity Panel, as follows:

EUR = European ancestry
CSA = Central/South Asian ancestry
AFR = African ancestry
EAS = East Asian ancestry
MID = Middle Eastern ancestry
AMR = Admixed American ancestry

These codes refer only to ancestry groupings used in GWAS, not necessarily other demographic or self-reported data.

Release data#

We release the summary statistics in two formats:

For one or a few phenotypes, we recommend using the phenotype-specific flat files: see further description here.
For analysis the full dataset (all phenotypes, all populations), the summary statistics are available in Hail formats: see further description here.

Approach#

Analysis was done using SAIGE implemented in Hail Batch to parallelize across populations, phenotypes, and regions of the genome. More details can be found below:

Details about the QC process can be found here including determination of ancestry groups.
Description of GWAS pipeline and implementation can be found on our Github.

The sample size for each population and the number of phenotypes run is as follows:

Population	Num. Individuals	Num. Phenotypes
AFR	6636	2493
AMR	980	1105
CSA	8876	2771
EAS	2709	1612
EUR	420531	7200
MID	1599	1372

Each phenotype may have fewer samples run, depending on data missingness, which can be found in the phenotype manifest, or n_cases and n_controls in the Hail MatrixTable.