Abstract  Polygenic risk scores (PRS) has been proven to help improve predictive models[66][65][34], build
instrumental variables[76][77], study disease etiology [60], and contribute to risk assessment when
combined with other diagnostic tools[38]. Thus their importance to research has increased over
time. However, PRS has limited transferability across ancestral populations due to differences in
linkage disequilibrium (LD), allele frequencies, and environmental exposure. In addition, most
of the GWAS samples are of European ancestry, and diversity has not increased in recent years.
As a result, a premature inclusion of PRS into clinical practice could increase health disparities across ancestral groups. Currently, there are several projects exist that are very intentional
in collecting samples from nonEuropean populations [51][31][82][1]. However, increasing samples of underrepresented groups must be done through community inclusion [28] and protecting
against commodification[26]. Moreover, even after all the previously mentioned efforts, the overrepresentation of European samples has not changed significantly up to 2021[23]. Thus, there is a
need for methods designed explicitly to improve the prediction and estimation of targeted populations to empower researchers from these communities to leverage the existing data efficiently.
Chapter 2 of this work quantifies the role of different LD structures on a PRS constructed
using European genomewide significant variants. We estimate the change across populations
of a Europeanderived tag variant’s predictive ability using extensive simulations with the 1000
Genome Project haplotypes. To isolate the effect of LD, we assume a genetic model with the same
effect size of the true underlying risk variant across the five populations of 1000 genomes. Under
this scenario, if the most significant variant is not the risk variant, then its predictive ability depends
on the LD between the index and true causal variants. In our simulations across the genome, we
found that even under an optimistic scenario, the index variant was not the risk variant around 60%
of the time. If, most of the time, we are not finding the causal variant, then how much predictive
ability do we expect to lose in different populations? Chapter 2 estimates that the reduction in the
predictive ability of the most significant variant is modest in Admixed American, South Asian, and
East Asian ancestral populations. However, in African populations, the loss of predictive ability
can be substantial, reaching up to a 50% reduction in 22% of the time. Finally, in this chapter, we
present evidence that suggests that LD score can be informative on the probability of tagging the
true risk variant in a region.
We are interested in improving prediction measures such as mean square error (MSE) or area
under the curve (AUC) by leveraging an external population with precise but biased information.
Generally, when considering an external population as a valuable source of information, it is assumed that any inference that relies on that population is biased at the gain of a reduction in
variance. In the context of the transferability of PRS in Chapter 2, we showed that different LD
could cause significant bias in the prediction of PRS. Nevertheless, we also showed that this bias
could be slight even in genetically distant populations. Chapter 3 proposes a method that dynamically adapts to LD and effect size differences across populations to increase the predictive ability
in one of them, called the target population. The method works with GWAS summary statistics
from two populations and returns an estimate of the multivariate effect size for a region for a target
population. GWAS summary statistics are effect size estimates of the univariate regressions and
thus are not comparable across populations with different LD structures. We use the Regression
with Summary Statistics to infer the multivariate effect size in each population by using the joint
likelihood of the marginal summary statistics. Nevertheless, the multivariate regression in a genomic region is still a miss specified model because it is impossible to include all the possible
interactions. When this is the case, the multivariate effect size might differ between populations
due to differences in allele frequencies and environmental exposures. To account for this possible
scenario, we use a Power Prior to account for the heterogeneity across populations. We simulated
GWAS data from European and African populations and showed that our method improves prediction in several measures when the genetic correlation is positive between populations. This method
has promising results in increasing the predictive ability of PRS for populations where the sample
size of the existing GWAS sample size is limited.
In nongenetics scenarios, there is extensive literature on leveraging existing studies to improve
estimation or prediction in one study. Chapter 4 extends the Data Enriched Linear Regression[15]
to generalized linear regression link functions. We show that the objective function of DELR is
equivalent to the objective function of penalized regression, which means we can use existing
software to obtain estimates. However, penalized regression does not differentiate between target
and external data sources, and thus it requires a different algorithm to find the best penalty. We
develop a CrossValidation algorithm to find the penalty factor that would optimize prediction in
the target population. Furthermore, we show through simulations that DEGLR improves prediction
when bias is small and converges to ignoring the external study as bias increases. In a real data
analysis, the Health and Retirement Study is our target data source, and the Genes for Good study
is our external study. We use these data sources to explore the ability to increase the predictive
ability of PRS as a covariate when the proportion of White participants is much higher in the
external data source than in the target. We systematically split the HRS data into small training
sets and increased the sample size gradually. From this analysis, we see that as the HRS’s training
sample size increases, DEGLR adapts the weight of GfG to optimize the predictive ability. When
the ratio of the synthetic GfG is large, the DEGLR uses the GfG data and increases the predictive
ability of HRS. As this ratio decreases, the DEGLR method uses less GfG data and matches the
predictive ability of HRS alone.
