VarPPUD：从一组优先的、强候选变体中精确定位诊断变体。

IF 3.6 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

PLoS Computational Biology Pub Date : 2025-09-22 eCollection Date: 2025-09-01 DOI:10.1371/journal.pcbi.1013414

Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach

{"title":"VarPPUD：从一组优先的、强候选变体中精确定位诊断变体。","authors":"Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach","doi":"10.1371/journal.pcbi.1013414","DOIUrl":null,"url":null,"abstract":"Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 9","pages":"e1013414"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468739/pdf/","citationCount":"0","resultStr":"{\"title\":\"VarPPUD: Pinpointing diagnostic variants from sets of prioritized, strong candidate variants.\",\"authors\":\"Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach\",\"doi\":\"10.1371/journal.pcbi.1013414\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.\",\"PeriodicalId\":20241,\"journal\":{\"name\":\"PLoS Computational Biology\",\"volume\":\"21 9\",\"pages\":\"e1013414\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468739/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pcbi.1013414\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1013414","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

据估计，全球每17人中就有1人患有罕见和超罕见遗传疾病，但准确确定每种疾病背后的诊断变异仍然是一项艰巨的挑战。由于对所有可能的遗传变异进行全面的体内功能评估是不可行的，临床医生转而考虑在计算机上进行变异致病性预测，以区分基因组中可能致病的变异和良性变异。然而，在最困难的未确诊病例中，例如那些被未确诊疾病网络（UDN）接受的病例，现有的致病性预测不能可靠地区分真正的病因变异和其他有害的候选变异，这些变异是通过病例或家庭水平的分析优先考虑的。从一小部分可能的候选者中确定致病变异仍然是一项很大程度上的人工工作，需要广泛的临床检查、功能和实验分析，并最终确定基因型和表型匹配的个体。在这里，我们介绍varppd，这是一种针对UDN病例的优先变异进行训练的工具，它利用基因、氨基酸和核苷酸水平的特征，从其他不太可能被证实与该疾病相关的破坏性或有害变异中识别致病性（疾病致病）变异。VarPPUD在具有独特挑战性的UDN病例中达到了79.3%的交叉验证准确率和77.5%的精度，在这项任务中，比现有的9种最先进的致病性预测工具分别平均提高了18.6%和23.4%。我们使用基于gan的框架生成的合成数据和在2022年至2024年间评估的暂时保留的UDN患者集来验证varppad区分可能与不太可能的致病变异的能力。该模型仅根据2021年之前的可用数据进行训练，无需再训练即可应用于2021年后的队列，对新累积病例显示出很强的通用性。最后，我们展示了如何探测VarPPUD来评估每个输入特征的重要性和对预测的贡献——这是理解新发现的致病变异的独特特征的重要一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VarPPUD: Pinpointing diagnostic variants from sets of prioritized, strong candidate variants.

Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS Computational Biology BIOCHEMICAL RESEARCH METHODS-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.10

自引率

4.70%

发文量

820

审稿时长

2.5 months

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.