Improving Machine Learning Prediction of ADHD Using Gene Set Polygenic Risk Scores and Risk Scores From Genetically Correlated Phenotypes.

American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics Pub Date : 2025-07-09 DOI:10.1002/ajmg.b.33043

Eric J Barnett, Yanli Zhang-James, Stephen V Faraone

{"title":"Improving Machine Learning Prediction of ADHD Using Gene Set Polygenic Risk Scores and Risk Scores From Genetically Correlated Phenotypes.","authors":"Eric J Barnett, Yanli Zhang-James, Stephen V Faraone","doi":"10.1002/ajmg.b.33043","DOIUrl":null,"url":null,"abstract":"<p><p>Polygenic risk scores (PRSs), which sum the effects of SNPs throughout the genome to measure risk afforded by common genetic variants, have improved our ability to estimate disorder risk for Attention-Deficit/Hyperactivity Disorder (ADHD) but the accuracy of risk prediction is rarely investigated. In a study of 10,887 participants across nine cohorts, we performed gene set analysis of GWAS data to select gene sets associated with ADHD within a training subset. For each gene set, we generated gene set polygenic risk scores (gsPRSs), which sum the effects of SNPs for each selected gene set. We created gsPRS for ADHD and for phenotypes that are genetically correlated with ADHD. These gsPRS were added to the standard PRS as input to machine learning models predicting ADHD. On the test subset, a random forest (RF) model using PRSs from ADHD and genetically correlated phenotypes and an optimized group of 20 gsPRS had an area under the receiving operating characteristic curve (AUC) of 0.72 (95% CI: 0.70-0.74). This AUC was a statistically significant improvement over logistic regression models and RF models using only PRS from ADHD and genetically correlated phenotypes. Summing risk at the gene set level and incorporating genetic risk from disorders with high genetic correlations with ADHD improved the accuracy of predicting ADHD. Learning curves suggest that additional improvements would be expected with larger study sizes. Our study suggests that better accounting of genetic risk and the genetic context of allelic differences results in more predictive models.</p>","PeriodicalId":520553,"journal":{"name":"American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics","volume":" ","pages":"e33043"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/ajmg.b.33043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Polygenic risk scores (PRSs), which sum the effects of SNPs throughout the genome to measure risk afforded by common genetic variants, have improved our ability to estimate disorder risk for Attention-Deficit/Hyperactivity Disorder (ADHD) but the accuracy of risk prediction is rarely investigated. In a study of 10,887 participants across nine cohorts, we performed gene set analysis of GWAS data to select gene sets associated with ADHD within a training subset. For each gene set, we generated gene set polygenic risk scores (gsPRSs), which sum the effects of SNPs for each selected gene set. We created gsPRS for ADHD and for phenotypes that are genetically correlated with ADHD. These gsPRS were added to the standard PRS as input to machine learning models predicting ADHD. On the test subset, a random forest (RF) model using PRSs from ADHD and genetically correlated phenotypes and an optimized group of 20 gsPRS had an area under the receiving operating characteristic curve (AUC) of 0.72 (95% CI: 0.70-0.74). This AUC was a statistically significant improvement over logistic regression models and RF models using only PRS from ADHD and genetically correlated phenotypes. Summing risk at the gene set level and incorporating genetic risk from disorders with high genetic correlations with ADHD improved the accuracy of predicting ADHD. Learning curves suggest that additional improvements would be expected with larger study sizes. Our study suggests that better accounting of genetic risk and the genetic context of allelic differences results in more predictive models.

查看原文本刊更多论文

使用基因集多基因风险评分和遗传相关表型风险评分改进ADHD机器学习预测。

多基因风险评分（PRSs）是一种综合整个基因组中snp的影响来衡量常见遗传变异所带来的风险的方法，它提高了我们估计注意力缺陷/多动障碍（ADHD）疾病风险的能力，但风险预测的准确性很少被研究。在一项横跨9个队列的10,887名参与者的研究中，我们对GWAS数据进行了基因集分析，以在训练子集中选择与ADHD相关的基因集。对于每个基因集，我们生成了基因集多基因风险评分（gsPRSs），该评分将每个选定基因集的snp效应相加。我们为ADHD和与ADHD基因相关的表型创建了gprs。这些gprs被添加到标准PRS中，作为预测ADHD的机器学习模型的输入。在测试子集上，使用来自ADHD和遗传相关表型的prs的随机森林（RF）模型和20个gprs的优化组的接收工作特征曲线（AUC）下面积为0.72 （95% CI: 0.70-0.74）。与logistic回归模型和仅使用ADHD和遗传相关表型的PRS的RF模型相比，该AUC在统计学上有显著改善。将基因集水平上的风险加起来，并结合与ADHD有高度遗传相关性的疾病的遗传风险，提高了预测ADHD的准确性。学习曲线表明，随着研究规模的扩大，预期会有更多的改进。我们的研究表明，更好地考虑遗传风险和等位基因差异的遗传背景会产生更多的预测模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics

自引率

0.00%

发文量