Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

IF 3.8 2区 生物学 Q2 GENETICS & HEREDITY
Toby R Manders, Christopher A Tan, Yuya Kobayashi, Alexander Wahl, Carlos Araya, Alexandre Colavin, Flavia M Facio, Hillery Metz, Jason Reuter, Laure Frésard, Samskruthi R Padigepati, David A Stafford, Robert L Nussbaum, Keith Nykamp
{"title":"Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.","authors":"Toby R Manders, Christopher A Tan, Yuya Kobayashi, Alexander Wahl, Carlos Araya, Alexandre Colavin, Flavia M Facio, Hillery Metz, Jason Reuter, Laure Frésard, Samskruthi R Padigepati, David A Stafford, Robert L Nussbaum, Keith Nykamp","doi":"10.1007/s00439-025-02743-z","DOIUrl":null,"url":null,"abstract":"<p><p>Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls (\"patient score\"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity (\"variant score\"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-025-02743-z","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls ("patient score"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity ("variant score"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.

利用大语言模型和贝叶斯推理,利用基因型和表型数据进行种群尺度的变异分类。
遗传疾病负担患者和临床医生基因检测中的不确定意义变异(VUS),但由于缺乏可扩展的策略,可能减少VUS的临床数据未得到充分利用。我们评估了使用基因型和表型数据的机器学习方法是否可以改善变异分类并降低VUS。在这个多步骤机器学习方法的队列研究中,来自测试申请表的患者数据被用来区分分子诊断患者和对照组(“患者评分”)。生成贝叶斯模型然后使用患者评分和变异分类来推断变异致病性(“变异评分”)。这项研究包括了350万名接受各种疾病临床基因检测的患者。主要结果是模型和基因水平的区分、分类表现、概率校准以及与正交致病性测量的一致性。将后验致病性概率与PPV≥0.99/NPV≥0.95阈值相匹配,纳入半定量分类框架,然后进行专家评审。我们生成了1334个临床变异模型(cvm);595在保留数据的机器学习步骤(AUROCpatient≥0.8和AUROCvariant≥0.8)中表现出高性能。这些cvm的高置信度预测为200,174例患者中观察到的5,362例VUS提供了证据,占这些基因中所有VUS观察值的23.4%。在17个经常检测的基因中,cvm重新分类了1000多个独特的VUS,每种情况下VUS报告率降低了9-49%。总之,利用未充分利用的临床数据的可扩展机器学习方法改进了变体分类并降低了VUS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Human Genetics
Human Genetics 生物-遗传学
CiteScore
10.80
自引率
3.80%
发文量
94
审稿时长
1 months
期刊介绍: Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology. Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted. The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信