wholeskim: Utilising Genome Skims for Taxonomically Annotating Ancient DNA Metagenomes.

IF 5.5 1区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Molecular Ecology Resources Pub Date : 2025-07-21 DOI:10.1111/1755-0998.70001

Lucas Elliott, Frédéric Boyer, Teo Lemane, Inger Greve Alsos, Eric Coissac

{"title":"wholeskim: Utilising Genome Skims for Taxonomically Annotating Ancient DNA Metagenomes.","authors":"Lucas Elliott, Frédéric Boyer, Teo Lemane, Inger Greve Alsos, Eric Coissac","doi":"10.1111/1755-0998.70001","DOIUrl":null,"url":null,"abstract":"<p><p>Inferring community composition from shotgun sequencing of environmental DNA is highly dependent on the completeness of reference databases used to assign taxonomic information as well as the pipeline used. While the number of complete, fully assembled reference genomes is increasing rapidly, their taxonomic coverage is generally too sparse to use them to build complete reference databases that span all or most of the target taxa. Low-coverage, whole genome sequencing, or skimming, provides a cost-effective and scalable alternative source of genome-wide information in the interim. Without enough coverage to assemble large contigs of nuclear DNA, much of the utility of a genome skim in the context of taxonomic annotation is found in its short read form. However, previous methods have not been able to fully leverage the data in this format. We demonstrate the utility of wholeskim, a pipeline for the indexing of k-mers present in genome skims and subsequent querying of these indices with short DNA reads. Wholeskim expands on the functionality of kmindex, a software which utilises Bloom filters to efficiently index and query billions of k-mers. Using a collection of thousands of plant genome skims, wholeskim is the only software that is able to index and query the skims in their unassembled form. It is able to correctly annotate 1.16× more simulated reads and 2.48× more true sedaDNA reads in 0.32× of the time required by Holi, another metagenomic pipeline that uses genome skims in their assembled form as its reference database input. We also explore the effects of taxonomic and genomic completeness of the reference database on the accuracy and sensitivity of read assignment. Increasing the genomic coverage of the genome skims used as reference increases the number of correctly annotated reads, but with diminishing returns after ~1× depth of coverage. Increasing taxonomic coverage clearly reduces the number of false negative taxa in the dataset, but we also demonstrate that it does not greatly impact false positive annotations.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e70001"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1111/1755-0998.70001","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Inferring community composition from shotgun sequencing of environmental DNA is highly dependent on the completeness of reference databases used to assign taxonomic information as well as the pipeline used. While the number of complete, fully assembled reference genomes is increasing rapidly, their taxonomic coverage is generally too sparse to use them to build complete reference databases that span all or most of the target taxa. Low-coverage, whole genome sequencing, or skimming, provides a cost-effective and scalable alternative source of genome-wide information in the interim. Without enough coverage to assemble large contigs of nuclear DNA, much of the utility of a genome skim in the context of taxonomic annotation is found in its short read form. However, previous methods have not been able to fully leverage the data in this format. We demonstrate the utility of wholeskim, a pipeline for the indexing of k-mers present in genome skims and subsequent querying of these indices with short DNA reads. Wholeskim expands on the functionality of kmindex, a software which utilises Bloom filters to efficiently index and query billions of k-mers. Using a collection of thousands of plant genome skims, wholeskim is the only software that is able to index and query the skims in their unassembled form. It is able to correctly annotate 1.16× more simulated reads and 2.48× more true sedaDNA reads in 0.32× of the time required by Holi, another metagenomic pipeline that uses genome skims in their assembled form as its reference database input. We also explore the effects of taxonomic and genomic completeness of the reference database on the accuracy and sensitivity of read assignment. Increasing the genomic coverage of the genome skims used as reference increases the number of correctly annotated reads, but with diminishing returns after ~1× depth of coverage. Increasing taxonomic coverage clearly reduces the number of false negative taxa in the dataset, but we also demonstrate that it does not greatly impact false positive annotations.

查看原文本刊更多论文

利用基因组图谱对古代DNA宏基因组进行分类注释。

从环境DNA的鸟枪测序推断群落组成高度依赖于用于分配分类信息的参考数据库的完整性以及所使用的管道。虽然完整的、完全组装的参考基因组的数量正在迅速增加，但它们的分类覆盖范围通常太过稀疏，无法使用它们来构建涵盖所有或大部分目标分类群的完整参考数据库。在此期间，低覆盖率的全基因组测序或略读提供了一种具有成本效益和可扩展的全基因组信息替代来源。由于没有足够的覆盖范围来组装大的核DNA组，在分类注释的背景下，基因组略读的大部分效用都是以其短读形式发现的。但是，以前的方法不能充分利用这种格式的数据。我们展示了批发商的效用，这是一个用于索引基因组中存在的k-mers的管道，并随后用短DNA读取查询这些指数。批发商扩展了kmindex的功能，这是一款利用Bloom过滤器有效索引和查询数十亿k-mers的软件。通过收集成千上万的植物基因组图谱，wholeskim是唯一一个能够以未组装的形式索引和查询这些图谱的软件。它能够在Holi所需的0.32倍的时间内正确注释1.16倍的模拟reads和2.48倍的真实sedaDNA reads， Holi是另一个使用基因组片段组装形式作为参考数据库输入的宏基因组管道。我们还探讨了参考数据库的分类和基因组完整性对读取分配的准确性和敏感性的影响。增加作为参考的基因组图谱的基因组覆盖率可以增加正确注释的reads的数量，但在覆盖深度为~1倍后收益递减。增加分类覆盖明显减少了数据集中假阴性分类群的数量，但我们也证明了它对假阳性注释的影响并不大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Ecology Resources 生物-进化生物学

CiteScore

15.60

自引率

5.20%

发文量

170

审稿时长

3 months

期刊介绍： Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines. In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.