Skip to content

joyceyiyiwang/Portability_Questions

Repository files navigation

Three Open Questions in Polygenic Score Portability

Provided below are instructions and details for scripts used to generate the results and figures in "Three Open Questions in Polygenic Score Portability".

For questions regarding the scripts, please contact Joyce Wang at [email protected].

Installation

Install the software:

  1. Plink 1.9 (Purcell, S. & Chang, C)
  2. Plink 2.0 (Purcell, S. & Chang, C)
  3. regenie (Mbatchou et al.)
  4. PRS-CS (Ge et al.)

Download the UK Biobank (UKB) dataset, following their guidelines. The scripts also use the 1000 Genomes phase 3 dataset provided by Plink, but it is not necessary to download it beforehand, as 05h_ukb_kgp_pca.sh contains scripts for downloading it.

For running the scripts, we recommend creating a conda environment.

git clone https://github.com/harpak-lab/Portability_Questions
cd Portability_Questions
conda env create -f environment.yml

Modification

Copy any scripts you need from the folder to the root directory of the cloned repo. Please note that scripts can depend on each other, and we do recommend copying all the scripts from one folder together. For example, to copy all the scripts under Prepare_the_data, use:

cd Prepare_the_data
cp * ../
cd ../

Before execution, the directories contained in the scripts need to be modified so that they point to your directories. #SBATCH -A OTH21148 also needs to be updated according to your allocation information.

Execution

Submit the bash scripts ending with .sh with sbatch <script_name.sh>, or refer to the documentation of your computing clusters on how to submit a job.

Please see the details for each script in the following sections:

Preparing the data

Execute the following files to filter and prepare the data:

  1. 00_make_directories.sh
  2. 01a_extract_data_fields.sh (make sure to edit the file so it's pointing to the correct UKB basket file)
  3. 01b_filter_individuals_job.sh
  4. 01d_filter_genotype_files.sh
  5. 02_prepare_covariates_phenotypes.sh

GWAS

In the selection of the GWAS sample, we used the White British classification as provided by the UKB.

Execute the following files to perform GWAS, clumping, and thresholding:

  1. 03_gwas.sh
  2. 04a_clumping.sh
  3. 04e_after_clumping.sh

Genetic distance calculations

The fixation index (Fst) is a natural metric, a single number, to measure the divergence between two sets of chromosomes and we considered using it to measure the distance between the pair of chromosomes of an individual and chromosomes in the GWAS sample. However, calculating Fst was computationally costly, so we used Euclidean distance in the PC space as a single number proxying genetic distance from the GWAS sample.

Execute the following files to calculate Fst:

  1. 05a_pc_dist_fst.sh
  2. All the scripts created under temp_fst_path

Then, execute the following files to calculate Euclidean distance:

  1. 05e_find_best_num_pc.sh (creates Supplementary Fig. 1)
  2. 05h_ukb_kgp_pca.sh (downloads 1000 Genomes phase 3 dataset provided by Plink)
  3. 05j_pc_dist_fst_plots.sh (creates Fig. 1)

PGS and evaluating PGS prediction accuracy

Execute the following file to calculate PGS:

  1. 06_compute_prs.sh

We evaluated PGS prediction accuracy at both the group level and individual level:

  1. 07_group_ind_level_pred.sh (creates Fig. 2; Supplementary Figs. 2-13)

We compared the variance in squared prediction error explained for 8 raw measures: genetic distance, Townsend Deprivation Index, average yearly total household income before tax, educational attainment, which we converted into years of education, minor allele counts for SNPs with different magnitudes of effects (three equally-sized bins of small, medium, and large squared effect sizes, see Fig. S23), and minor allele counts of all SNPs:

  1. 08a_prepare_for_ma_counts.sh
  2. 08b_calc_ma_counts.sh
  3. 08d_ind_pred_plots.sh (creates Fig. 3; Supplementary Figs. 14-21)

Additional analyses on lymphocyte count

To understand why immunity-related traits like lymphocyte count have group-level prediction accuracy that drops near zero even at a short genetic distance, we performed additional analyses.

We first performed two additional GWASs and compared the allelic effects across the three GWASs:

  1. 09a_prepare_close_far_pca.sh
  2. 09c_close_far_pca.sh
  3. 09d_prepare_close_far_gwas.sh
  4. 09f_gwas_close.sh and 09f_gwas_far.sh

We calculated heterozygosity at index SNPs as a function of genetic distance:

  1. 09g_calc_heterozygosity.sh

We examined the variance of PGS as a function of genetic distance:

  1. 09j_compare_effect_sizes_heterozygosity_var_pgs_plots.sh (creates Fig. 4; Supplementary Fig. 22)

We estimated the heritability associated with each index SNP:

  1. 10a_compare_heritability.sh (creates Supplementary Figs. 23-24)

Portability of disease traits

We estimated the PGS portability of 3 disease traits at the group level.

For these disease traits, we ran GWAS, clumped and thresholded the SNPs, and calculated PGS:

  1. 11a_gwas_disease.sh
  2. 11c_clumping_disease.sh
  3. 11f_after_clumping_disease.sh
  4. 11g_compute_pgs_disease.sh

Then we estimated group level portability of the disease traits:

  1. 11h_group_level_pred_disease.sh (creates Fig. 5; Supplementary Fig. 66)

Distribution of important variables in the dataset

We first plotted the distribution of Townsend deprivation index, household income, sex, age, and country as a function of genetic distance:

  1. 12_townsend_income_sex_age_country.sh (creates Supplementary Figs. 25-29)

Then we plotted the correlation between Fst and PCs 1, 2, 3, and 40:

  1. 13_pcs_vs_fst.sh (creates Supplementary Figs. 30-33)

Sensitivity analyses for major portability trends

We performed a few sensitivity analyses to test the major portability trends of the 15 quantitative traits.

Adding assessment center and genotype array as covariates to GWAS:

  1. 14a_gwas_array_center.sh
  2. 14c_clumping_center_array.sh
  3. 14d_after_clumping_center_array.sh
  4. 14e_compute_pgs_array_center.sh

Running GWAS with regenie:

  1. 14f_gwas_regenie.sh
  2. 14g_clumping_regenie.sh
  3. 14j_after_clumping_regenie.sh
  4. 14k_compute_pgs_regenie.sh

Estimating PGS with PRS-CS:

  1. 14l_compute_pgs_prscs.sh

Re-running GWAS with 300K WB

  1. 14n_gwas_300K.sh
  2. 14p_clumping_300K.sh
  3. 14q_after_clumping_300K.sh
  4. 14r_compute_pgs_300K.sh
  5. 14s_pc_dist_fst_300K.sh
  6. 14w_find_best_num_pc_300K.sh
  7. 14z_pc_dist_fst_plots_300K.sh (creates Supplementary Fig. 71)

Then we evaluated PGS prediction accuracy at both the group level and individual level and compared the results with the original plots (Fig. 2; Supplementary Figs. 2-13):

  1. 15c_group_ind_level_pred_sensitivity.sh (creates Supplementary Figs. 34-65, 67-70, 72-79)

About

Scripts for: Three Open Questions in Polygenic Score Portability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published