Hail Format
#
Release filesThe results of this analysis are released in two main files on Google Cloud Storage (file format compatible with Hail >= 0.2.42):
- Summary statistics MatrixTable:
gs://ukb-diverse-pops-public/sumstats_release/results_full.mt
(12.78 T) - Meta-analysis MatrixTable:
gs://ukb-diverse-pops-public/sumstats_release/meta_analysis.mt
(12.54 T)
These are also available on Amazon S3:
- Summary statistics MatrixTable:
s3://pan-ukb-us-east-1/sumstats_release/results_full.mt
(12.78 T) - Meta-analysis MatrixTable:
s3://pan-ukb-us-east-1/sumstats_release/meta_analysis.mt
(12.54 T)
In addition, in-sample full LD matrices and scores are available on Amazon S3:
- LD BlockMatrix
s3://pan-ukb-us-east-1/ld_release/UKBB.{pop}.ldadj.bm
(43.3 T in total)- Size by population: AFR: 12.0 T, AMR: 3.3 T, CSA: 6.4T, EAS: 2.6T, EUR: 14.1T, MID: 4.9T
- Variant index Hail Table
s3://pan-ukb-us-east-1/ld_release/UKBB.{pop}.ldadj.variant.ht
(1.7 G in total) - LD score Hail Table
s3://pan-ukb-us-east-1/ld_release/UKBB.{pop}.ldscore.ht
(4.0 G in total)
where {pop}
represents one of the population abbreviations (i.e., AFR, AMR, CSA, EAS, EUR, or MID).
#
Requester paysNote that the files in the Google Cloud Storage bucket are "requester pays." In order to compute over these files or download them, you will need to specify a project which may be billed for access and download costs. The data are stored in a US multi-region bucket: thus, access to the dataset is free for use for Compute Engine instances started within US regions, as well as for full downloads within the US and Canada. When performing large analyses on the dataset, we suggest "bringing the compute to the data" and starting a VM or Dataproc cluster in a US region. You can browse the directory structure in a requester pays bucket with the -u
flag (and note the hl.init
call below to access the data using Hail):
#
Using the libraries and filesThe files on Google Cloud Platform can be accessed by cloning the ukbb_pan_ancestry and the ukb_common repos and accessing them programmatically. We recommend using these functions, as they apply our QC metrics (e.g. the raw file contains 7,271 phenotypes, but use of this function will return 7,221 phenotypes after removing low-quality ones) and include convenience metrics such as lambda GC.
#
Results schemaThe basic summary statistics have the following schema:
#
Columns (phenotypes)The columns are indexed by phenotype using a composite key of trait type, phenocode, pheno_sex, coding, and modifier. Trait types have one of the values below. phenocode
typically corresponds to the Field from UK Biobank, or the specific ICD code or phecode, or a custom moniker. pheno_sex
designates which sexes were run, and is marked as both_sexes
for most traits, though some phecodes were restricted to females
or males
. The coding
field is primarily used for categorical variables, to indicate which one-hot encoding was used (e.g. coding 2 for field 1747). Finally, modifier
refers to any downstream modifications of the phenotype (e.g. irnt
for inverse-rank normal transformation).
By default, the MatrixTable loaded by load_final_sumstats_mt
returns one column per phenotype-population pair. We can see the number of unique phenotypes for each trait_type
by:
You can explore the population-level data in more detail using:
More information about the GWAS run is found in the pheno_data
struct. By default, when loading using load_final_sumstats_mt
, the best practice QC parameters are used, which removes traits with a lambda GC < 0.5 or > 2. If this is undesirable, use load_final_sumstats_mt(filter_phenos=False)
.
#
Rows (variants)The rows are indexed by locus and alleles. Direct annotations can be found in the vep
schema, but we also provide a nearest_genes
annotation for ease of analysis. Additionally, variant QC annotations are provided in the high_quality
field (which is filtered to by default using load_final_sumstats_mt
and can be switched off in the filter_variants
parameter in that function).
#
Entries (association tests)The entry fields house the summary statistics themselves. Note that there is a low_confidence
annotation that indicates a possible low-quality association test (allele count in cases or controls <= 3, or overall minor allele count < 20).
The resulting dataset can be filtered and annotated as a standard Hail MatrixTable:
#
Meta-analysis filesThe meta-analysis results are in a similarly structured file:
Here, the results are provided in an array, which includes the all-available-population meta-analysis in the 0th element meta_mt.meta_analysis[0]
and leave-one-out meta-analyses in the remainder of the array.
#
Combining the datasetsWe also provide a function to annotate the overall sumstats MatrixTable with the largest meta-analysis for that phenotype.
If your analysis requires the simultaneous analysis of summary statistics from multiple populations (and not the meta-analysis), you can load the data with a similar structure to the meta-analysis MatrixTable (one column per phenotype, with population information packed into an array of entries and columns) using load_final_sumstats_mt(separate_columns_by_pop=False)
.
#
LD matricesThe LD matrices are in BlockMatrix
format. Please refer to Hail's documentation for available operations on BlockMatrix
.
We note that the LD matrices were sparsified to a upper triangle (all elements of the lower triangle were zeroed out using BlockMatrix.sparsify_triangle
).
#
Variant indicesTo determine which row/column corresponds to which variant, we provide variant indices for BlockMatrix
in Hail Table
format.
The variant indices table has the following schema and idx
corresponds to a row/column index in BlockMatrix
.
#
Extracting a subset of LD matrixTo extract a subset of LD matrix, you first need to identify indices of your variants of interest. Here, we provide two examples:
Then, you can filter the LD matrix into a subset using BlockMatrix.filter
:
#
Exporting a LD matrix to a flat fileFinally, to export a LD matrix to a flat file (txt file), you can use BlockMatrix.export
:
If your matrix is small enough to fit on memory, you can also directly export it to numpy via BlockMatrix.to_numpy
.
#
LD scoresThe LD scores are in Hail Table format. For LDSC-compatible flat files, you can find them here.
The LD score table has the following schema.