Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach
{"title":"VarPPUD:从一组优先的、强候选变体中精确定位诊断变体。","authors":"Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach","doi":"10.1371/journal.pcbi.1013414","DOIUrl":null,"url":null,"abstract":"<p><p>Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 9","pages":"e1013414"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468739/pdf/","citationCount":"0","resultStr":"{\"title\":\"VarPPUD: Pinpointing diagnostic variants from sets of prioritized, strong candidate variants.\",\"authors\":\"Rui Yin, Alba Gutiérrez-Sacristán, Shilpa Nadimpalli Kobren, Paul Avillach\",\"doi\":\"10.1371/journal.pcbi.1013414\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.</p>\",\"PeriodicalId\":20241,\"journal\":{\"name\":\"PLoS Computational Biology\",\"volume\":\"21 9\",\"pages\":\"e1013414\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468739/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pcbi.1013414\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1013414","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
VarPPUD: Pinpointing diagnostic variants from sets of prioritized, strong candidate variants.
Rare and ultra-rare genetic conditions are estimated to impact nearly 1 in 17 people worldwide, yet accurately pinpointing the diagnostic variants underlying each of these conditions remains a formidable challenge. Because comprehensive, in vivo functional assessment of all possible genetic variants is infeasible, clinicians instead consider in silico variant pathogenicity predictions to distinguish plausibly disease-causing from benign variants across the genome. However, in the most difficult undiagnosed cases, such as those accepted to the Undiagnosed Diseases Network (UDN), existing pathogenicity predictions cannot reliably discern true etiological variant(s) from other deleterious candidate variants that were prioritized through case- or family-level analyses. Pinpointing the disease-causing variant from a small pool of plausible candidates remains a largely manual effort requiring extensive clinical workups, functional and experimental assays, and eventual identification of genotype- and phenotype-matched individuals. Here, we introduce VarPPUD, a tool trained on prioritized variants from UDN cases, that leverages gene-, amino acid-, and nucleotide-level features to discern pathogenic (disease causative) variants from other damaging or deleterious variants that are unlikely to be confirmed as relevant to the disease. VarPPUD achieves a cross-validated accuracy of 79.3% and precision of 77.5% on a held-out subset of uniquely challenging UDN cases, respectively representing an average 18.6% and 23.4% improvement over nine existing state-of-the-art pathogenicity prediction tools on this task. We validate VarPPUD's ability to discriminate likely from unlikely pathogenic variants using both synthetic data generated via a GAN-based framework and a temporally held-out set of UDN patients evaluated between 2022 and 2024. The model was trained exclusively on data available through 2021 and applied without retraining to the post-2021 cohort, demonstrating strong generalizability to newly accrued cases. Finally, we show how VarPPUD can be probed to evaluate each input feature's importance and contribution toward prediction-an essential step toward understanding the distinct characteristics of newly-uncovered disease-causing variants.
期刊介绍:
PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery.
Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines.
Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights.
Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology.
Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.