克服限制,自定义DeepVariant家养动物与TrioTrain

IF 5.5 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Jenna Kalleberg, Jacob Rissman, Robert D Schnabel
{"title":"克服限制,自定义DeepVariant家养动物与TrioTrain","authors":"Jenna Kalleberg, Jacob Rissman, Robert D Schnabel","doi":"10.1101/gr.279542.124","DOIUrl":null,"url":null,"abstract":"Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a \"universal\" algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"64 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain\",\"authors\":\"Jenna Kalleberg, Jacob Rissman, Robert D Schnabel\",\"doi\":\"10.1101/gr.279542.124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a \\\"universal\\\" algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.\",\"PeriodicalId\":12678,\"journal\":{\"name\":\"Genome research\",\"volume\":\"64 1\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1101/gr.279542.124\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.279542.124","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

由于大多数生物信息学工具默认基于人类基因组的假设,因此在不同物种中生成高质量的变体呼叫集仍然具有挑战性。DeepVariant (DV)在没有联合基因分型的情况下表现出色,同时提供了更少的实施障碍。然而,对“通用”算法日益增长的吸引力,放大了对非人类物种的未知影响。我们使用牛基因组来评估使用人类基因组训练的变异呼叫者的局限性,包括等位基因频率通道(DV-AF)和联合呼叫者DeepTrio (DT)。我们的新方法TrioTrain使用区域洗牌方法来减轻基于slurm的集群的障碍,自动扩展缺乏瓶中基因组(GIAB)资源的二倍体物种的DV。在训练DV正确地对后代进行基因型之前,对不完美的动物真相标签进行了整理,以去除孟德尔不一致位点。使用TrioTrain,我们使用牛,牦牛和野牛三重奏来创建第一个多物种训练的DV-AF检查点。虽然不完整的牛真值集限制了具有挑战性的重复区域的召回,但我们在GIAB基准测试期间观察到新检查点的平均SNV F1得分>;0.990。在HG002中,经过牛训练的检查点(28)将孟德尔遗传错误率(MIE)与默认(DV)相比降低了两倍。在三个牛种间杂交基因组中,检查点28的平均MIE率为0.03%。这些结果表明,多物种、基于三元组的训练策略减少了单样本变异调用过程中的遗传错误。虽然专门使用人类基因组进行训练会阻止将基于深度学习的变异召唤转移到新物种,但我们使用动物内部的多样化祖先来说明为比较基因组学设计的先进工具的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain
Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Genome research
Genome research 生物-生化与分子生物学
CiteScore
12.40
自引率
1.40%
发文量
140
审稿时长
6 months
期刊介绍: Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信