Three Open Questions in Polygenic Score Portability

Provided below are instructions and details for scripts used to generate the results and figures in "Three Open Questions in Polygenic Score Portability".

For questions regarding the scripts, please contact Joyce Wang at [email protected].

Installation

Install the software:

Download the UK Biobank (UKB) dataset, following their guidelines. The scripts also use the 1000 Genomes phase 3 dataset provided by Plink, but it is not necessary to download it beforehand, as 05h_ukb_kgp_pca.sh contains scripts for downloading it.

For running the scripts, we recommend creating a conda environment.

git clone https://github.com/harpak-lab/Portability_Questions
cd Portability_Questions
conda env create -f environment.yml

Modification

Copy any scripts you need from the folder to the root directory of the cloned repo. Please note that scripts can depend on each other, and we do recommend copying all the scripts from one folder together. For example, to copy all the scripts under Prepare_the_data, use:

cd Prepare_the_data
cp * ../
cd ../

Before execution, the directories contained in the scripts need to be modified so that they point to your directories. #SBATCH -A OTH21148 also needs to be updated according to your allocation information.

Execution

Submit the bash scripts ending with .sh with sbatch <script_name.sh>, or refer to the documentation of your computing clusters on how to submit a job.

Please see the details for each script in the following sections:

Preparing the data

Execute the following files to filter and prepare the data:

00_make_directories.sh
01a_extract_data_fields.sh (make sure to edit the file so it's pointing to the correct UKB basket file)
01b_filter_individuals_job.sh
01d_filter_genotype_files.sh
02_prepare_covariates_phenotypes.sh

GWAS

In the selection of the GWAS sample, we used the White British classification as provided by the UKB.

Execute the following files to perform GWAS, clumping, and thresholding:

03_gwas.sh
04a_clumping.sh
04e_after_clumping.sh

Genetic distance calculations

The fixation index (F_st) is a natural metric, a single number, to measure the divergence between two sets of chromosomes and we considered using it to measure the distance between the pair of chromosomes of an individual and chromosomes in the GWAS sample. However, calculating Fst was computationally costly, so we used Euclidean distance in the PC space as a single number proxying genetic distance from the GWAS sample.

Execute the following files to calculate F_st:

05a_pc_dist_fst.sh
All the scripts created under temp_fst_path

Then, execute the following files to calculate Euclidean distance:

05e_find_best_num_pc.sh (creates Supplementary Fig. 1)
05h_ukb_kgp_pca.sh (downloads 1000 Genomes phase 3 dataset provided by Plink)
05j_pc_dist_fst_plots.sh (creates Fig. 1)

PGS and evaluating PGS prediction accuracy

Execute the following file to calculate PGS:

06_compute_prs.sh

We evaluated PGS prediction accuracy at both the group level and individual level:

07_group_ind_level_pred.sh (creates Fig. 2; Supplementary Figs. 2-13)

We compared the variance in squared prediction error explained for 8 raw measures: genetic distance, Townsend Deprivation Index, average yearly total household income before tax, educational attainment, which we converted into years of education, minor allele counts for SNPs with different magnitudes of effects (three equally-sized bins of small, medium, and large squared effect sizes, see Fig. S23), and minor allele counts of all SNPs:

08a_prepare_for_ma_counts.sh
08b_calc_ma_counts.sh
08d_ind_pred_plots.sh (creates Fig. 3; Supplementary Figs. 14-21)

Additional analyses on lymphocyte count

To understand why immunity-related traits like lymphocyte count have group-level prediction accuracy that drops near zero even at a short genetic distance, we performed additional analyses.

We first performed two additional GWASs and compared the allelic effects across the three GWASs:

09a_prepare_close_far_pca.sh
09c_close_far_pca.sh
09d_prepare_close_far_gwas.sh
09f_gwas_close.sh and 09f_gwas_far.sh

We calculated heterozygosity at index SNPs as a function of genetic distance:

09g_calc_heterozygosity.sh

We examined the variance of PGS as a function of genetic distance:

09j_compare_effect_sizes_heterozygosity_var_pgs_plots.sh (creates Fig. 4; Supplementary Fig. 22)

We estimated the heritability associated with each index SNP:

10a_compare_heritability.sh (creates Supplementary Figs. 23-24)

Portability of disease traits

We estimated the PGS portability of 3 disease traits at the group level.

For these disease traits, we ran GWAS, clumped and thresholded the SNPs, and calculated PGS:

11a_gwas_disease.sh
11c_clumping_disease.sh
11f_after_clumping_disease.sh
11g_compute_pgs_disease.sh

Then we estimated group level portability of the disease traits:

11h_group_level_pred_disease.sh (creates Fig. 5; Supplementary Fig. 66)

Distribution of important variables in the dataset

We first plotted the distribution of Townsend deprivation index, household income, sex, age, and country as a function of genetic distance:

12_townsend_income_sex_age_country.sh (creates Supplementary Figs. 25-29)

Then we plotted the correlation between F_st and PCs 1, 2, 3, and 40:

13_pcs_vs_fst.sh (creates Supplementary Figs. 30-33)

Sensitivity analyses for major portability trends

We performed a few sensitivity analyses to test the major portability trends of the 15 quantitative traits.

Adding assessment center and genotype array as covariates to GWAS:

14a_gwas_array_center.sh
14c_clumping_center_array.sh
14d_after_clumping_center_array.sh
14e_compute_pgs_array_center.sh

Running GWAS with regenie:

14f_gwas_regenie.sh
14g_clumping_regenie.sh
14j_after_clumping_regenie.sh
14k_compute_pgs_regenie.sh

Estimating PGS with PRS-CS:

14l_compute_pgs_prscs.sh

Re-running GWAS with 300K WB

14n_gwas_300K.sh
14p_clumping_300K.sh
14q_after_clumping_300K.sh
14r_compute_pgs_300K.sh
14s_pc_dist_fst_300K.sh
14w_find_best_num_pc_300K.sh
14z_pc_dist_fst_plots_300K.sh (creates Supplementary Fig. 71)

Then we evaluated PGS prediction accuracy at both the group level and individual level and compared the results with the original plots (Fig. 2; Supplementary Figs. 2-13):

15c_group_ind_level_pred_sensitivity.sh (creates Supplementary Figs. 34-65, 67-70, 72-79)

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
Additional_analyses_on_lymphocyte_count		Additional_analyses_on_lymphocyte_count
Distribution_of_important_variables_in_the_dataset		Distribution_of_important_variables_in_the_dataset
GWAS		GWAS
Genetic_distance_calculations		Genetic_distance_calculations
PGS_and_evaluating_PGS_prediction_accuracy		PGS_and_evaluating_PGS_prediction_accuracy
Portability_of_disease_traits		Portability_of_disease_traits
Prepare_the_data		Prepare_the_data
Sensitivity_analyses_for_major_portability_trends		Sensitivity_analyses_for_major_portability_trends
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Three Open Questions in Polygenic Score Portability

Installation

Modification

Execution

Preparing the data

GWAS

Genetic distance calculations

PGS and evaluating PGS prediction accuracy

Additional analyses on lymphocyte count

Portability of disease traits

Distribution of important variables in the dataset

Sensitivity analyses for major portability trends

Adding assessment center and genotype array as covariates to GWAS:

Running GWAS with regenie:

Estimating PGS with PRS-CS:

Re-running GWAS with 300K WB

About

Uh oh!

Releases

Packages

Languages

joyceyiyiwang/Portability_Questions

Folders and files

Latest commit

History

Repository files navigation

Three Open Questions in Polygenic Score Portability

Installation

Modification

Execution

Preparing the data

GWAS

Genetic distance calculations

PGS and evaluating PGS prediction accuracy

Additional analyses on lymphocyte count

Portability of disease traits

Distribution of important variables in the dataset

Sensitivity analyses for major portability trends

Adding assessment center and genotype array as covariates to GWAS:

Running GWAS with regenie:

Estimating PGS with PRS-CS:

Re-running GWAS with 300K WB

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages