Provided below are instructions and details for scripts used to generate the results and figures in "Three Open Questions in Polygenic Score Portability".
For questions regarding the scripts, please contact Joyce Wang at [email protected].
Install the software:
- Plink 1.9 (Purcell, S. & Chang, C)
- Plink 2.0 (Purcell, S. & Chang, C)
- regenie (Mbatchou et al.)
- PRS-CS (Ge et al.)
Download the UK Biobank (UKB) dataset, following their guidelines. The scripts also use the 1000 Genomes phase 3 dataset provided by Plink, but it is not necessary to download it beforehand, as 05h_ukb_kgp_pca.sh contains scripts for downloading it.
For running the scripts, we recommend creating a conda environment.
git clone https://github.com/harpak-lab/Portability_Questions
cd Portability_Questions
conda env create -f environment.yml
Copy any scripts you need from the folder to the root directory of the cloned repo. Please note that scripts can depend on each other, and we do recommend copying all the scripts from one folder together. For example, to copy all the scripts under Prepare_the_data, use:
cd Prepare_the_data
cp * ../
cd ../
Before execution, the directories contained in the scripts need to be modified so that they point to your directories. #SBATCH -A OTH21148 also needs to be updated according to your allocation information.
Submit the bash scripts ending with .sh with sbatch <script_name.sh>, or refer to the documentation of your computing clusters on how to submit a job.
Please see the details for each script in the following sections:
Execute the following files to filter and prepare the data:
00_make_directories.sh01a_extract_data_fields.sh(make sure to edit the file so it's pointing to the correct UKB basket file)01b_filter_individuals_job.sh01d_filter_genotype_files.sh02_prepare_covariates_phenotypes.sh
In the selection of the GWAS sample, we used the White British classification as provided by the UKB.
Execute the following files to perform GWAS, clumping, and thresholding:
03_gwas.sh04a_clumping.sh04e_after_clumping.sh
The fixation index (Fst) is a natural metric, a single number, to measure the divergence between two sets of chromosomes and we considered using it to measure the distance between the pair of chromosomes of an individual and chromosomes in the GWAS sample. However, calculating Fst was computationally costly, so we used Euclidean distance in the PC space as a single number proxying genetic distance from the GWAS sample.
Execute the following files to calculate Fst:
05a_pc_dist_fst.sh- All the scripts created under
temp_fst_path
Then, execute the following files to calculate Euclidean distance:
05e_find_best_num_pc.sh(creates Supplementary Fig. 1)05h_ukb_kgp_pca.sh(downloads 1000 Genomes phase 3 dataset provided by Plink)05j_pc_dist_fst_plots.sh(creates Fig. 1)
Execute the following file to calculate PGS:
06_compute_prs.sh
We evaluated PGS prediction accuracy at both the group level and individual level:
07_group_ind_level_pred.sh(creates Fig. 2; Supplementary Figs. 2-13)
We compared the variance in squared prediction error explained for 8 raw measures: genetic distance, Townsend Deprivation Index, average yearly total household income before tax, educational attainment, which we converted into years of education, minor allele counts for SNPs with different magnitudes of effects (three equally-sized bins of small, medium, and large squared effect sizes, see Fig. S23), and minor allele counts of all SNPs:
08a_prepare_for_ma_counts.sh08b_calc_ma_counts.sh08d_ind_pred_plots.sh(creates Fig. 3; Supplementary Figs. 14-21)
To understand why immunity-related traits like lymphocyte count have group-level prediction accuracy that drops near zero even at a short genetic distance, we performed additional analyses.
We first performed two additional GWASs and compared the allelic effects across the three GWASs:
09a_prepare_close_far_pca.sh09c_close_far_pca.sh09d_prepare_close_far_gwas.sh09f_gwas_close.shand09f_gwas_far.sh
We calculated heterozygosity at index SNPs as a function of genetic distance:
09g_calc_heterozygosity.sh
We examined the variance of PGS as a function of genetic distance:
09j_compare_effect_sizes_heterozygosity_var_pgs_plots.sh(creates Fig. 4; Supplementary Fig. 22)
We estimated the heritability associated with each index SNP:
10a_compare_heritability.sh(creates Supplementary Figs. 23-24)
We estimated the PGS portability of 3 disease traits at the group level.
For these disease traits, we ran GWAS, clumped and thresholded the SNPs, and calculated PGS:
11a_gwas_disease.sh11c_clumping_disease.sh11f_after_clumping_disease.sh11g_compute_pgs_disease.sh
Then we estimated group level portability of the disease traits:
11h_group_level_pred_disease.sh(creates Fig. 5; Supplementary Fig. 66)
We first plotted the distribution of Townsend deprivation index, household income, sex, age, and country as a function of genetic distance:
12_townsend_income_sex_age_country.sh(creates Supplementary Figs. 25-29)
Then we plotted the correlation between Fst and PCs 1, 2, 3, and 40:
13_pcs_vs_fst.sh(creates Supplementary Figs. 30-33)
We performed a few sensitivity analyses to test the major portability trends of the 15 quantitative traits.
14a_gwas_array_center.sh14c_clumping_center_array.sh14d_after_clumping_center_array.sh14e_compute_pgs_array_center.sh
Running GWAS with regenie:
14f_gwas_regenie.sh14g_clumping_regenie.sh14j_after_clumping_regenie.sh14k_compute_pgs_regenie.sh
Estimating PGS with PRS-CS:
14l_compute_pgs_prscs.sh
14n_gwas_300K.sh14p_clumping_300K.sh14q_after_clumping_300K.sh14r_compute_pgs_300K.sh14s_pc_dist_fst_300K.sh14w_find_best_num_pc_300K.sh14z_pc_dist_fst_plots_300K.sh(creates Supplementary Fig. 71)
Then we evaluated PGS prediction accuracy at both the group level and individual level and compared the results with the original plots (Fig. 2; Supplementary Figs. 2-13):
15c_group_ind_level_pred_sensitivity.sh(creates Supplementary Figs. 34-65, 67-70, 72-79)