Lucas Elliott, Frédéric Boyer, Teo Lemane, Inger Greve Alsos, Eric Coissac
{"title":"wholeskim: Utilising Genome Skims for Taxonomically Annotating Ancient DNA Metagenomes.","authors":"Lucas Elliott, Frédéric Boyer, Teo Lemane, Inger Greve Alsos, Eric Coissac","doi":"10.1111/1755-0998.70001","DOIUrl":null,"url":null,"abstract":"<p><p>Inferring community composition from shotgun sequencing of environmental DNA is highly dependent on the completeness of reference databases used to assign taxonomic information as well as the pipeline used. While the number of complete, fully assembled reference genomes is increasing rapidly, their taxonomic coverage is generally too sparse to use them to build complete reference databases that span all or most of the target taxa. Low-coverage, whole genome sequencing, or skimming, provides a cost-effective and scalable alternative source of genome-wide information in the interim. Without enough coverage to assemble large contigs of nuclear DNA, much of the utility of a genome skim in the context of taxonomic annotation is found in its short read form. However, previous methods have not been able to fully leverage the data in this format. We demonstrate the utility of wholeskim, a pipeline for the indexing of k-mers present in genome skims and subsequent querying of these indices with short DNA reads. Wholeskim expands on the functionality of kmindex, a software which utilises Bloom filters to efficiently index and query billions of k-mers. Using a collection of thousands of plant genome skims, wholeskim is the only software that is able to index and query the skims in their unassembled form. It is able to correctly annotate 1.16× more simulated reads and 2.48× more true sedaDNA reads in 0.32× of the time required by Holi, another metagenomic pipeline that uses genome skims in their assembled form as its reference database input. We also explore the effects of taxonomic and genomic completeness of the reference database on the accuracy and sensitivity of read assignment. Increasing the genomic coverage of the genome skims used as reference increases the number of correctly annotated reads, but with diminishing returns after ~1× depth of coverage. Increasing taxonomic coverage clearly reduces the number of false negative taxa in the dataset, but we also demonstrate that it does not greatly impact false positive annotations.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e70001"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1111/1755-0998.70001","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Inferring community composition from shotgun sequencing of environmental DNA is highly dependent on the completeness of reference databases used to assign taxonomic information as well as the pipeline used. While the number of complete, fully assembled reference genomes is increasing rapidly, their taxonomic coverage is generally too sparse to use them to build complete reference databases that span all or most of the target taxa. Low-coverage, whole genome sequencing, or skimming, provides a cost-effective and scalable alternative source of genome-wide information in the interim. Without enough coverage to assemble large contigs of nuclear DNA, much of the utility of a genome skim in the context of taxonomic annotation is found in its short read form. However, previous methods have not been able to fully leverage the data in this format. We demonstrate the utility of wholeskim, a pipeline for the indexing of k-mers present in genome skims and subsequent querying of these indices with short DNA reads. Wholeskim expands on the functionality of kmindex, a software which utilises Bloom filters to efficiently index and query billions of k-mers. Using a collection of thousands of plant genome skims, wholeskim is the only software that is able to index and query the skims in their unassembled form. It is able to correctly annotate 1.16× more simulated reads and 2.48× more true sedaDNA reads in 0.32× of the time required by Holi, another metagenomic pipeline that uses genome skims in their assembled form as its reference database input. We also explore the effects of taxonomic and genomic completeness of the reference database on the accuracy and sensitivity of read assignment. Increasing the genomic coverage of the genome skims used as reference increases the number of correctly annotated reads, but with diminishing returns after ~1× depth of coverage. Increasing taxonomic coverage clearly reduces the number of false negative taxa in the dataset, but we also demonstrate that it does not greatly impact false positive annotations.
期刊介绍:
Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines.
In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.