Phase of compound heterozygotes in gnomAD
Information for v4: Implementation is currently in progress.
Information for v2: This repository serves as a home for the pipeline used to infer the phase of rare variants in the gnomAD v2 exomes, as reported in our corresponding manuscript (see https://www.nature.com/articles/s41588-023-01608-3), and is coded in Hail 0.2.
The main components of the pipeline can be found in "phasing.py". Please note the most up-to-date phasing algorithm is called from “compute_gnomad_phasing.py”. Briefly, to infer variant phase, we generate haplotype frequency estimates from genotype counts by applying the expectation-maximization (EM) algorithm (see “get_em_expressions” function which calls "hl.experimental.haplotype_freq_em") and calculate the probability of two variants being in trans (compound heterozygous, “p_chet”).
To run the pipeline, go use the compute_gnomad_phase.py script.
The remaining scripts in the repository serve to compute the phase of rare variant pairs specifically in the gnomAD and Center for Mendelian Genetics rare disease datasets, and to generate the gnomAD variant co-occurrence look-up tool (see https://gnomad.broadinstitute.org/variant-cooccurrence) and variant co-occurrence counts by gene resource (see https://gnomad.broadinstitute.org/news/2023-03-variant-co-occurrence-counts-by-gene-in-gnomad/). These scripts cannot be run outside of the gnomAD team, as they require access to the individual level data, and are provided for reference only.