First release

June 15, 2020 · 4 min read

Pan UKBB Team

We are thrilled to announce the release of GWAS summary statistics from the Pan-UK Biobank resource, which consists of genome-wide association analyses of 7,221 phenotypes across 6 continental ancestry groups in the UK Biobank. Across all phenotype-ancestry pairs, we conducted 16,131 GWAS and meta-analyzed summary statistics for all available populations by trait. This release includes more than 20,000 individuals with primarily non-European ancestries, substantially increasing the diversity typically investigated in analyses of these data.

A summary of the breakdown included in this release is:

Population	Sample size	Total phenos	Categorical	Continuous	Phecode	ICD-10	Biomarkers	Prescriptions
AFR	6636	2493	981	337	197	725	30	223
AMR	980	1105	423	31	20	561	30	40
CSA	8876	2771	1051	418	234	719	30	319
EAS	2709	1612	618	91	55	714	29	105
EUR	420531	7200	3672	1325	929	800	30	444
MID	1599	1372	509	83	52	591	30	107

Rapidly developing this resource required a massively scalable computational framework, and we are thankful to the Hail team for building and supporting the system that enabled this work. We developed a pipeline for these analyses using Hail Batch and provide our analytical code for reference to the community. We ran this pipeline in the Batch Service, a multi-tenant compute cluster in Google Cloud managed by the Hail team, which at the time of use, enabled the simultaneous use of up to 100,000 CPUs. Across all traits and phenotypes, the association tests required 3.8 million CPU-hours, which Hail Batch enabled to be completed in approximately 6 days (wall-clock).

We felt obligated to make this resource publicly available as soon as it was stable given its potential benefits to society, for example through contribution to pressing activities such as the COVID-19 Host Genetics Initiative. We will be further developing and analyzing this resource, with some additions forthcoming, as follows:

We will be adding additional features to the website, including searchable and interactive summary statistics data as well as multi-ancestry visualizations of association results
We will continually analyze and release updated COVID-19 phenotypes for all ancestry groups as they become available
We are committed to computing LD scores in the UK Biobank for all ancestry groups analyzed in this resource and will provide these when they have been rigorously tested
We will calculate heritabilities using LD score regression for each ancestry group and phenotype
We will keep the phenotype manifest and release files updated with all further analyses

Diverse genetic studies are critical to equitably advance scientific discoveries and applications. However, history through current events teach us that some will unintentionally or willfully misuse this resource to advance racist agendas. Acknowledging the harm done to certain groups in the past in the name of science indicates the importance of careful communication of scientific research, its implications, and the intense vigilance required to ensure that disadvantaged groups are not further harmed by this and other related work. We have adopted several strategies in an effort to maximize benefits and minimize risks, most notably through the development of a set of FAQs. These are designed to guide proper use and interpretation of these results.

Analyzing a more inclusive and diverse dataset increases power for discovery, enhances the resolution of these findings, and improves the generalizability of genetic associations across populations. We hope that this research will encourage future studies to make use of data that is traditionally left out of analysis and develop more diverse resources to ensure that genetics can execute on the mission of improving healthcare for all.

Recent posts