Per-phenotype files
#
OverviewThe data are released in 7,221 flat files, one for each phenotype, and a corresponding tabix index file for each. These files are available on Amazon AWS (for large-scale analysis, we recommend using the Hail format files on Google Cloud).
The files are named with respect to their trait_type
, phenocode
, and a combination of pheno_sex
, coding
, or modifier
. To find a specific phenotype, we suggest looking in the phenotype manifest (Google Sheets) (available for download on Amazon). Search for your phenotype(s) of interest and use the paths indicated to download the summary statistics. A description of fields in the manifest can be found here.
The per-phenotype files are summary statistics files containing meta-analyzed and single-ancestry GWAS results. We especially highlight the low_confidence
fields, which includes some (non-exhaustive) basic quality control filters (see below). These files each have 28,987,534 variants, but note that not all populations will have data for each variant.
Finally, the variant manifest file includes information on each variant in the dataset and has the same number of rows as each per-phenotype file. We highlight the high_quality
column which represents variants that are PASS variants in gnomAD and have consistent frequencies with each population in gnomAD (AFR, AMR, EAS, and EUR frequencies are within 2-fold or chi-squared p-value of the difference > 1e-6).
#
Phenotype manifest filePan-UK Biobank phenotype manifest (Google Sheets) (download on Amazon)
#
Phenotype ID fieldsThe first 5 fields are guaranteed to be unique.
trait_type
: One of the following: continuous, biomarkers, prescriptions, icd10, phecode, categorical
phenocode
: The code for the phenotype (for continuous, biomarkers, and categorical traits, this corresponds to the field ID as described by UKB, e.g. 21001 for BMI)
pheno_sex
: Indicating whether the phenotype was run for both sexes (pheno_sex
="both_sexes") or in just females (pheno_sex
="females") or males (pheno_sex
="males"). In 0.1, this is only differentiated for phecodes.
coding
: For categorical variables, this corresponds to the coding that was used (e.g. coding 2 for field 1747). For all other trait_type
s, this field is blank.
modifier
: Refers to any miscellaneous downstream modifications of the phenotype (e.g. irnt
for inverse-rank normal transformation). If the phenotype is updated, this field can be used to denote the update (e.g. the particular wave of COVID-19 data used).
description
: A shorter description of the phenotype (for continuous, biomarkers, and categorical variables, corresponds to the Description on the showcase). For phecodes, this is the "description" column in the phecodes definition file.
description_more
: A longer description of the phenotype (for continuous and categorical variables, corresponds to the Notes page on the showcase).
coding_description
: For categorical variables, a description of the particular coding that was used (the Meaning column on the showcase page for that coding).
category
: A categorization of the phenotype. For continuous, biomarkers, and categorical traits, this corresponds to the Category at the top of the showcase page. For ICD codes, this corresponds to the Chapter of the ICD code; for phecodes, this is the "group" column in the phecodes definition file; for prescriptions, this corresponds to a semi-manual categorization of prescription drugs.
#
Case and ancestry fieldsnote
If a trait is quantitative (trait_type
is "continuous" or "biomarkers"), all samples are considered to be "cases". Thus, the number of cases is equivalent to the number of samples.
n_cases_full_cohort_both_sexes
: Number of cases (or individuals phenotyped for quantitative traits) across all ancestry groups, females and males combined. May include ancestry outliers and samples that failed QC.
n_cases_full_cohort_females
: Number of female cases (or individuals phenotyped for quantitative traits) across all ancestry groups. May include ancestry outliers and samples that failed QC.
n_cases_full_cohort_males
: Number of male cases (or individuals phenotyped for quantitative traits) across all ancestry groups. May include ancestry outliers and samples that failed QC.
pops
: Comma-delimited list of ancestry codes for which this phenotypes was GWASed.
num_pops
: Number of ancestry groups for which this phenotype was GWASed.
#
Population-specific fieldsnote
The variable pop
is a placeholder for a 3-letter ancestry code. For example, n_cases_AFR
is the number of cases with AFR ancestry.
note
If a trait is quantitative (trait_type
is "continuous" or "biomarkers"), all samples are considered to be "cases". Thus, the number of cases is equivalent to the number of samples.
n_cases_{pop}
: Number of cases (or individuals phenotyped for quantitative traits) with pop
ancestry in the GWAS analysis. Excludes ancestry outliers and samples that failed QC.
n_controls_{pop}
: Number of controls with pop
ancestry in the GWAS analysis. Excludes ancestry outliers and samples that failed QC.
saige_heritability_{pop}
: The heritability as estimated by SAIGE: note that this is likely not well-calibrated for binary traits, or traits with high heritabilities. A second estimate of heritability from LD score regression is coming soon.
lambda_gc_{pop}
: The genomic control (lambda GC) calculated from the summary statistics for pop
with low-confidence statistics removed and only considering high-quality variants.
#
File informationnote
For each field in this section there also exists a field with the suffix _tabix
, which contains the equivalent information for the tabix file. For instance, filename_tabix
contains the name of the tabix file.
filename
: Name of summary statistics file.
aws_link
: Link to download summary statistics file from Amazon AWS.
wget
: wget command to download summary statistics file.
size_in_bytes
: Size of summary statistics file in bytes.
md5_hex
: MD5 hexadecimal hash.
#
Per-phenotype filesThe per-phenotype files are tsv.bgz
files are (b)gzipped: they can either be unzipped (zcat file.tsv.bgz > file.txt
), or read natively in R (read_delim(gzfile('file.tsv.bgz'), delim='\t')
) and Python (gzip.open('file.tsv.bgz')
).
Depending on whether a phenotype is quantitative (trait_type
is "continuous" or "biomarkers") or binary (trait_type
is "prescriptions", "icd10", "phecode" or "categorical"), the number of columns will change due to case/control-stratified statistics for binary phenotypes.
#
Variant fieldschr
: Chromosome of the variant.
pos
: Position of the variant in GRCh37 coordinates.
ref
: Reference allele on the forward strand.
alt
: Alternate allele (not necessarily minor allele). Used as effect allele for GWAS.
#
Meta-analysis fieldsaf_meta
: Alternate allele frequency from meta-analysis across populations for which this phenotype was GWASed. NOTE: This field only appears in files for quantitative phenotypes.
af_cases_meta
: Alternate allele frequency in cases from meta-analysis across populations for which this phenotype was GWASed. NOTE: This field only appears in files for binary phenotypes.
af_controls_meta
: Alternate allele frequency in controls from meta-analysis across populations for which this phenotype was GWASed. NOTE: This field only appears in files for binary phenotypes.
beta_meta
: Estimated effect size of alternate allele from meta-analysis across populations for which this phenotype was GWASed.
se_meta
: Estimated standard error of beta_meta
.
pval_meta
: p-value of beta_meta
significance test.
pval_heterogeneity
: p-value from heterogeneity test of meta-analysis.
#
Population-specific fieldsnote
The variable pop
used in this section is a placeholder for a 3-letter ancestry code. For example, af_AFR
is the alternate allele frequency for AFR samples included in the GWAS of this phenotype.
note
An ancestry-specific column is only included in the file if a GWAS was run for that ancestry. For example, a trait that was only GWASed in AMR and CSA samples will only have the fields af_AMR
, af_CSA
, beta_AMR
, beta_CSA
, etc.
af_{pop}
: Alternate allele frequency for pop
samples included in the GWAS of this phenotype. NOTE: This field only appears in files for quantitative phenotypes.
af_cases_{pop}
: Alternate allele frequency for pop
cases included in the GWAS of this phenotype. NOTE: This field only appears in files for binary phenotypes.
af_controls_{pop}
: Alternate allele frequency for pop
controls included in the GWAS of this phenotype. NOTE: This field only appears in files for binary phenotypes.
beta_{pop}
: Estimated effect size of alternate allele from GWAS of pop
samples.
se_{pop}
: Estimated standard error of beta_{pop}
.
pval_{pop}
: p-value of beta_{pop}
significance test.
low_confidence_{pop}
: Boolean flag indicating low confidence for pop
based on the following heuristics:
- Alternate allele count in cases <= 3
- Alternate allele count in controls <= 3
- Minor allele count (cases and controls combined) <= 20
#
Variant manifest fileVariant manifest (download from Amazon AWS)
#
Variant fieldsAs in per-phenotype files.
chr
: Chromosome of the variant.
pos
: Position of the variant in GRCh37 coordinates.
ref
: Reference allele on the forward strand.
alt
: Alternate allele (not necessarily minor allele). Used as effect allele for GWAS.
rsid
: The RSID for the variant (from the BGEN file from UK Biobank).
varid
: The variant ID for the variant (from the BGEN file from UK Biobank).
pass_gnomad_genomes
: A boolean corresponding to the PASS status in gnomAD (NA
if variant is not in gnomAD).
n_passing_populations
: The number of populations (max 4: AFR, AMR, EAS, EUR) where the frequency in UKB is less than twice the frequency in gnomAD for the corresponding population (see below), and the p-value of a chi-squared test assessing the difference is > 1e-6.
high_quality
: A boolean corresponding to a high-quality variant based on these filters (pass_gnomad_genomes & n_passing_populations == 4
).
nearest_genes
: The nearest genes for this variant based on Gencode v19.
info
: The Info score for this variant (from the ukb_mfi_chrN_v2.txt
files).
#
Population-specific fieldsnote
The variable pop
used in this section is a placeholder for a 3-letter ancestry code. For example, af_AFR
is the alternate allele frequency for AFR individuals in the whole dataset.
ac_{pop}
: The alternate allele count for this variant across all individuals in pop
. Defined as af_{pop} * an_{pop}
.
af_{pop}
: The alternate allele frequency for this variant across all individuals in pop
. The mean dosage
divided by two.
an_{pop}
: The alternate allele number for this variant. This is twice the number of pop
individuals with a defined genotype at this site.
gnomad_genomes_ac_{pop}
: The alternate allele count for this variant in the nearest gnomAD population: AFR, EAS, and AMR are matched as-is, while EUR is matched to the "North-West European" subset of gnomAD (not available for CSA or MID, as these populations are not in gnomAD v2 genomes).
gnomad_genomes_af_{pop}
: The alternate allele frequency for this variant in the nearest gnomAD population.
gnomad_genomes_an_{pop}
: The alternate allele number for this variant in the nearest gnomAD population. This is twice the number of individuals with a defined genotype at this site.