varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction.

IF 10.4 1区生物学 Q1 GENETICS & HEREDITY

Genome Medicine Pub Date : 2025-08-04 DOI:10.1186/s13073-025-01517-6

Lusiné Nazaretyan, Philipp Rentzsch, Martin Kircher

{"title":"varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction.","authors":"Lusiné Nazaretyan, Philipp Rentzsch, Martin Kircher","doi":"10.1186/s13073-025-01517-6","DOIUrl":null,"url":null,"abstract":"Background: Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards the protein-coding genome, or even towards few well-studied genes.Methods: To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent standing variation and the deleterious set with rare or singleton variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6.Results: Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the-art accuracy, globally on par with deleteriousness scores of CADD v1.6 and v1.7, but also outperforming them in certain genomic regions. Being larger than conventional training datasets, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools.Conclusions: Standing variation allows us to directly train state-of-the-art models for genome-wide variant prioritization or to augment evolutionary-derived variants in training. The proposed datasets have several advantages, like being substantially larger and potentially less biased. Datasets derived from standing variation represent natural allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.","PeriodicalId":12645,"journal":{"name":"Genome Medicine","volume":"17 1","pages":"84"},"PeriodicalIF":10.4000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12323237/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Medicine","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13073-025-01517-6","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards the protein-coding genome, or even towards few well-studied genes.

Methods: To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent standing variation and the deleterious set with rare or singleton variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6.

Results: Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the-art accuracy, globally on par with deleteriousness scores of CADD v1.6 and v1.7, but also outperforming them in certain genomic regions. Being larger than conventional training datasets, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools.

Conclusions: Standing variation allows us to directly train state-of-the-art models for genome-wide variant prioritization or to augment evolutionary-derived variants in training. The proposed datasets have several advantages, like being substantially larger and potentially less biased. Datasets derived from standing variation represent natural allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.

查看原文本刊更多论文

varCADD：大量的遗传变异使全基因组的致病性预测成为可能。

背景：机器学习和人工智能越来越多地被应用于识别表型因果遗传变异。这些数据驱动的方法需要全面的训练集来提供可靠的结果。然而，用于变异优先排序和效应预测的大型无偏数据集很少，因为大多数可用的数据库并不代表变异效应的广泛集合，而且往往偏向于蛋白质编码基因组，甚至倾向于少数得到充分研究的基因。方法：为了克服这些问题，我们提出了几个来自人类站立变化子集的替代训练集。具体来说，我们使用了gnomAD v3.0中包含的71156个个体的全基因组序列中鉴定的变异，并近似地描述了具有频繁持续变异的良性组和具有罕见或单例变异的有害组。我们应用了组合注释依赖消耗框架（CADD），并使用CADD v1.6训练了几个备选模型。结果：使用NCBI ClinVar验证集，我们证明了替代模型具有最先进的准确性，在全球范围内与CADD v1.6和v1.7的有害分数相当，但在某些基因组区域也优于它们。常值变异数据集比传统的训练数据集（包括CADD中约3000万个变异的进化衍生训练数据集）更大，涵盖了更广泛的基因组区域和应用注释的罕见实例。例如，它们涵盖了基因调控区域中常见的更近期的进化变化，用传统工具来评估这些变化更具挑战性。结论：长期变异使我们能够直接训练最先进的全基因组变异优先级模型，或者在训练中增加进化衍生的变异。提出的数据集有几个优点，比如规模更大，潜在的偏差更小。来自站立变异的数据集代表了人类基因组中自然的等位基因变化，不需要大量的模拟和适应用于CADD训练的进化衍生序列改变的注释。我们为社区提供数据集和训练模型，以供进一步开发和应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Medicine GENETICS & HEREDITY-

CiteScore

20.80

自引率

0.80%

发文量

128

审稿时长

6-12 weeks

期刊介绍： Genome Medicine is an open access journal that publishes outstanding research applying genetics, genomics, and multi-omics to understand, diagnose, and treat disease. Bridging basic science and clinical research, it covers areas such as cancer genomics, immuno-oncology, immunogenomics, infectious disease, microbiome, neurogenomics, systems medicine, clinical genomics, gene therapies, precision medicine, and clinical trials. The journal publishes original research, methods, software, and reviews to serve authors and promote broad interest and importance in the field.