Persistent Gaps and Errors in Reference Databases Impede Ecologically Meaningful Taxonomy Assignments in 18S rRNA Studies: A Case Study of Terrestrial and Marine Nematodes

Q1 Agricultural and Biological Sciences
Alejandro De Santiago, Tiago José Pereira, Timothy John Ferrero, Natalie Barnes, Delphine Lallias, Simon Creer, Holly M. Bik
{"title":"Persistent Gaps and Errors in Reference Databases Impede Ecologically Meaningful Taxonomy Assignments in 18S rRNA Studies: A Case Study of Terrestrial and Marine Nematodes","authors":"Alejandro De Santiago,&nbsp;Tiago José Pereira,&nbsp;Timothy John Ferrero,&nbsp;Natalie Barnes,&nbsp;Delphine Lallias,&nbsp;Simon Creer,&nbsp;Holly M. Bik","doi":"10.1002/edn3.70080","DOIUrl":null,"url":null,"abstract":"<p>In metabarcoding studies, Linnaean taxonomy assignments of Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) underpin many downstream bioinformatics analyses and ecological interpretations of environmental DNA (eDNA) datasets. However, public molecular databases (i.e., SILVA, EUKARYOME, BOLD) for most microbial metazoan phyla (nematodes, tardigrades, kinorhynchs, etc.) are sparsely populated, negatively impacting our ability to assign ecologically meaningful taxonomy to these understudied groups. Additionally, the choice of bioinformatics parameters and computational algorithms can further affect the accuracy of eDNA taxonomy assignments. Here, we use two <i>in silico</i> datasets to show that taxonomy assignments using the 18S rRNA gene can be dramatically improved by curating Linnaean taxonomy strings associated with each reference sequence and closing phylogenetic gaps by improving taxon sampling. Using free-living nematodes as a case study, we applied two commonly used taxonomy assignment algorithms (BLAST+ and the QIIME2 Naïve Bayes classifier) across six iterations of the SILVA 138 reference database to evaluate the precision and accuracy of taxonomy assignments. The BLAST+ top hit with a 90% sequence similarity cutoff often returned the highest percentage of correctly assigned taxonomy at the genus level, and the QIIME2 Naïve Bayes classifier performed similarly well when paired with a reference database containing corrected taxonomy strings. Our results highlight the urgent need for phylogenetically informed expansions of public reference databases (encompassing both genomes and common gene markers), focused on poorly sampled lineages that are now robustly recovered via eDNA metabarcoding approaches. Additional taxonomy curation efforts should be applied to popular reference databases such as SILVA, and taxon sampling could be rapidly improved by more frequent incorporation of newly published GenBank sequences linked to genus- and/or species-level identifications.</p>","PeriodicalId":52828,"journal":{"name":"Environmental DNA","volume":"7 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/edn3.70080","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental DNA","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/edn3.70080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0

Abstract

In metabarcoding studies, Linnaean taxonomy assignments of Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) underpin many downstream bioinformatics analyses and ecological interpretations of environmental DNA (eDNA) datasets. However, public molecular databases (i.e., SILVA, EUKARYOME, BOLD) for most microbial metazoan phyla (nematodes, tardigrades, kinorhynchs, etc.) are sparsely populated, negatively impacting our ability to assign ecologically meaningful taxonomy to these understudied groups. Additionally, the choice of bioinformatics parameters and computational algorithms can further affect the accuracy of eDNA taxonomy assignments. Here, we use two in silico datasets to show that taxonomy assignments using the 18S rRNA gene can be dramatically improved by curating Linnaean taxonomy strings associated with each reference sequence and closing phylogenetic gaps by improving taxon sampling. Using free-living nematodes as a case study, we applied two commonly used taxonomy assignment algorithms (BLAST+ and the QIIME2 Naïve Bayes classifier) across six iterations of the SILVA 138 reference database to evaluate the precision and accuracy of taxonomy assignments. The BLAST+ top hit with a 90% sequence similarity cutoff often returned the highest percentage of correctly assigned taxonomy at the genus level, and the QIIME2 Naïve Bayes classifier performed similarly well when paired with a reference database containing corrected taxonomy strings. Our results highlight the urgent need for phylogenetically informed expansions of public reference databases (encompassing both genomes and common gene markers), focused on poorly sampled lineages that are now robustly recovered via eDNA metabarcoding approaches. Additional taxonomy curation efforts should be applied to popular reference databases such as SILVA, and taxon sampling could be rapidly improved by more frequent incorporation of newly published GenBank sequences linked to genus- and/or species-level identifications.

Abstract Image

参考数据库中持续的空白和错误阻碍了18S rRNA研究中有生态意义的分类分配:以陆地和海洋线虫为例
在元条形码研究中,操作分类单元(OTUs)或扩增子序列变异(asv)的Linnaean分类分配支撑了许多下游生物信息学分析和环境DNA (eDNA)数据集的生态解释。然而,大多数微生物后生动物门(线虫,缓步动物,kinorhynchs等)的公共分子数据库(即SILVA, EUKARYOME, BOLD)人口稀少,这对我们为这些未被充分研究的类群分配有生态意义的分类的能力产生了负面影响。此外,生物信息学参数和计算算法的选择会进一步影响eDNA分类分配的准确性。在这里,我们使用两个计算机数据集来证明,通过与每个参考序列相关联的林奈分类字符串和通过改进分类单元采样来缩小系统发育差距,可以显著改善使用18S rRNA基因的分类分配。以自由生活线虫为例,采用BLAST+和QIIME2 Naïve贝叶斯分类器两种常用的分类分配算法,对SILVA 138参考数据库进行了6次迭代,以评估分类分配的精度和准确性。具有90%序列相似性截止值的BLAST+ top命中通常在属水平上返回正确分配分类的最高百分比,QIIME2 Naïve贝叶斯分类器在与包含正确分类字符串的参考数据库配对时表现相似。我们的研究结果强调了迫切需要对公共参考数据库(包括基因组和共同基因标记)进行系统发育方面的扩展,重点关注现在通过eDNA元条形码方法强有力地恢复的采样不足的谱系。更多的分类管理工作应应用于流行的参考数据库,如SILVA,并且通过更频繁地结合新发表的与属和/或种水平鉴定相关的GenBank序列,可以迅速改善分类单元采样。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental DNA
Environmental DNA Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics
CiteScore
11.00
自引率
0.00%
发文量
99
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信