Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.

IF 3.6 2区生物学 Q2 GENETICS & HEREDITY

Human Genetics Pub Date : 2025-06-01 Epub Date: 2025-04-23 DOI:10.1007/s00439-025-02743-z

Toby R Manders, Christopher A Tan, Yuya Kobayashi, Alexander Wahl, Carlos Araya, Alexandre Colavin, Flavia M Facio, Hillery Metz, Jason Reuter, Laure Frésard, Samskruthi R Padigepati, David A Stafford, Robert L Nussbaum, Keith Nykamp

{"title":"Harnessing genotype and phenotype data for population-scale variant classification using large language models and bayesian inference.","authors":"Toby R Manders, Christopher A Tan, Yuya Kobayashi, Alexander Wahl, Carlos Araya, Alexandre Colavin, Flavia M Facio, Hillery Metz, Jason Reuter, Laure Frésard, Samskruthi R Padigepati, David A Stafford, Robert L Nussbaum, Keith Nykamp","doi":"10.1007/s00439-025-02743-z","DOIUrl":null,"url":null,"abstract":"<p><p>Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls (\"patient score\"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity (\"variant score\"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"605-614"},"PeriodicalIF":3.6000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12170740/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-025-02743-z","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Variants of Uncertain Significance (VUS) in genetic testing for hereditary diseases burden patients and clinicians, yet clinical data that could reduce VUS are underutilized due to a lack of scalable strategies. We assessed whether a machine learning approach using genotype and phenotype data could improve variant classification and reduce VUS. In this cohort study of a multi-step machine learning approach, patient data from test requisition forms were used to distinguish patients with molecular diagnoses from controls ("patient score"). A generative Bayesian model then used patient scores and variant classifications to infer variant pathogenicity ("variant score"). The study included 3.5 million patients referred for clinical genetic testing across various conditions. Primary outcomes were model- and gene-level discrimination, classification performance, probabilistic calibration, and concordance with orthogonal pathogenicity measures. Integration into a semi-quantitative classification framework was based on posterior pathogenicity probabilities matching PPV ≥ 0.99/NPV ≥ 0.95 thresholds, followed by expert review. We generated 1,334 clinical variant models (CVMs); 595 showed high performance in both machine learning steps (AUROCpatient ≥ 0.8 and AUROCvariant ≥ 0.8) on held-out data. High-confidence predictions from these CVMs provided evidence for 5,362 VUS observed in 200,174 patients, representing 23.4% of all VUS observations in these genes. In 17 frequently tested genes, CVMs reclassified over 1,000 unique VUS, reducing VUS report rates by 9-49% per condition. In conclusion, a scalable machine learning approach using underutilized clinical data improved variant classification and reduced VUS.

查看原文本刊更多论文

利用大语言模型和贝叶斯推理，利用基因型和表型数据进行种群尺度的变异分类。

遗传疾病负担患者和临床医生基因检测中的不确定意义变异（VUS），但由于缺乏可扩展的策略，可能减少VUS的临床数据未得到充分利用。我们评估了使用基因型和表型数据的机器学习方法是否可以改善变异分类并降低VUS。在这个多步骤机器学习方法的队列研究中，来自测试申请表的患者数据被用来区分分子诊断患者和对照组（“患者评分”）。生成贝叶斯模型然后使用患者评分和变异分类来推断变异致病性（“变异评分”）。这项研究包括了350万名接受各种疾病临床基因检测的患者。主要结果是模型和基因水平的区分、分类表现、概率校准以及与正交致病性测量的一致性。将后验致病性概率与PPV≥0.99/NPV≥0.95阈值相匹配，纳入半定量分类框架，然后进行专家评审。我们生成了1334个临床变异模型（cvm）；595在保留数据的机器学习步骤（AUROCpatient≥0.8和AUROCvariant≥0.8）中表现出高性能。这些cvm的高置信度预测为200,174例患者中观察到的5,362例VUS提供了证据，占这些基因中所有VUS观察值的23.4%。在17个经常检测的基因中，cvm重新分类了1000多个独特的VUS，每种情况下VUS报告率降低了9-49%。总之，利用未充分利用的临床数据的可扩展机器学习方法改进了变体分类并降低了VUS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Human Genetics 生物-遗传学

CiteScore

10.80

自引率

3.80%

发文量

审稿时长

1 months

期刊介绍： Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology. Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted. The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.