Transferability of Polygenic Risk Scores Across Ancestral Populations and Data Integration Methods for Improving Prediction on Small Sample Studies

Year of Publication	2023
Author	Pedro del Pino
Institution	University of Michigan
City	Ann Arbor, Michigan
Abstract	Polygenic risk scores (PRS) has been proven to help improve predictive models[66][65][34], build instrumental variables[76][77], study disease etiology [60], and contribute to risk assessment when combined with other diagnostic tools[38]. Thus their importance to research has increased over time. However, PRS has limited transferability across ancestral populations due to differences in linkage disequilibrium (LD), allele frequencies, and environmental exposure. In addition, most of the GWAS samples are of European ancestry, and diversity has not increased in recent years. As a result, a pre-mature inclusion of PRS into clinical practice could increase health disparities across ancestral groups. Currently, there are several projects exist that are very intentional in collecting samples from non-European populations [51][31][82][1]. However, increasing samples of under-represented groups must be done through community inclusion [28] and protecting against commodification[26]. Moreover, even after all the previously mentioned efforts, the overrepresentation of European samples has not changed significantly up to 2021[23]. Thus, there is a need for methods designed explicitly to improve the prediction and estimation of targeted populations to empower researchers from these communities to leverage the existing data efficiently. Chapter 2 of this work quantifies the role of different LD structures on a PRS constructed using European genome-wide significant variants. We estimate the change across populations of a European-derived tag variant’s predictive ability using extensive simulations with the 1000 Genome Project haplotypes. To isolate the effect of LD, we assume a genetic model with the same effect size of the true underlying risk variant across the five populations of 1000 genomes. Under this scenario, if the most significant variant is not the risk variant, then its predictive ability depends on the LD between the index and true causal variants. In our simulations across the genome, we found that even under an optimistic scenario, the index variant was not the risk variant around 60% of the time. If, most of the time, we are not finding the causal variant, then how much predictive ability do we expect to lose in different populations? Chapter 2 estimates that the reduction in the predictive ability of the most significant variant is modest in Admixed American, South Asian, and East Asian ancestral populations. However, in African populations, the loss of predictive ability can be substantial, reaching up to a 50% reduction in 22% of the time. Finally, in this chapter, we present evidence that suggests that LD score can be informative on the probability of tagging the true risk variant in a region. We are interested in improving prediction measures such as mean square error (MSE) or area under the curve (AUC) by leveraging an external population with precise but biased information. Generally, when considering an external population as a valuable source of information, it is assumed that any inference that relies on that population is biased at the gain of a reduction in variance. In the context of the transferability of PRS in Chapter 2, we showed that different LD could cause significant bias in the prediction of PRS. Nevertheless, we also showed that this bias could be slight even in genetically distant populations. Chapter 3 proposes a method that dynamically adapts to LD and effect size differences across populations to increase the predictive ability in one of them, called the target population. The method works with GWAS summary statistics from two populations and returns an estimate of the multivariate effect size for a region for a target population. GWAS summary statistics are effect size estimates of the univariate regressions and thus are not comparable across populations with different LD structures. We use the Regression with Summary Statistics to infer the multivariate effect size in each population by using the joint likelihood of the marginal summary statistics. Nevertheless, the multivariate regression in a genomic region is still a miss specified model because it is impossible to include all the possible interactions. When this is the case, the multivariate effect size might differ between populations due to differences in allele frequencies and environmental exposures. To account for this possible scenario, we use a Power Prior to account for the heterogeneity across populations. We simulated GWAS data from European and African populations and showed that our method improves prediction in several measures when the genetic correlation is positive between populations. This method has promising results in increasing the predictive ability of PRS for populations where the sample size of the existing GWAS sample size is limited. In non-genetics scenarios, there is extensive literature on leveraging existing studies to improve estimation or prediction in one study. Chapter 4 extends the Data Enriched Linear Regression[15] to generalized linear regression link functions. We show that the objective function of DELR is equivalent to the objective function of penalized regression, which means we can use existing software to obtain estimates. However, penalized regression does not differentiate between target and external data sources, and thus it requires a different algorithm to find the best penalty. We develop a Cross-Validation algorithm to find the penalty factor that would optimize prediction in the target population. Furthermore, we show through simulations that DEGLR improves prediction when bias is small and converges to ignoring the external study as bias increases. In a real data analysis, the Health and Retirement Study is our target data source, and the Genes for Good study is our external study. We use these data sources to explore the ability to increase the predictive ability of PRS as a covariate when the proportion of White participants is much higher in the external data source than in the target. We systematically split the HRS data into small training sets and increased the sample size gradually. From this analysis, we see that as the HRS’s training sample size increases, DEGLR adapts the weight of GfG to optimize the predictive ability. When the ratio of the synthetic GfG is large, the DEGLR uses the GfG data and increases the predictive ability of HRS. As this ratio decreases, the DEGLR method uses less GfG data and matches the predictive ability of HRS alone.
URL	https://deepblue.lib.umich.edu/handle/2027.42/175640
Download citation	Google Scholar