维管植物DNA条形码数据库的自动化管理

IF 6.2 Q1 Agricultural and Biological Sciences
Andreas Kolter, Paul Hebert
{"title":"维管植物DNA条形码数据库的自动化管理","authors":"Andreas Kolter,&nbsp;Paul Hebert","doi":"10.1002/edn3.70125","DOIUrl":null,"url":null,"abstract":"<p>Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.</p>","PeriodicalId":52828,"journal":{"name":"Environmental DNA","volume":"7 3","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/edn3.70125","citationCount":"0","resultStr":"{\"title\":\"Automating the Curation of DNA Barcode Databases for Vascular Plants\",\"authors\":\"Andreas Kolter,&nbsp;Paul Hebert\",\"doi\":\"10.1002/edn3.70125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.</p>\",\"PeriodicalId\":52828,\"journal\":{\"name\":\"Environmental DNA\",\"volume\":\"7 3\",\"pages\":\"\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/edn3.70125\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental DNA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/edn3.70125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental DNA","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/edn3.70125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0

摘要

全面的、精心策划的、当前的DNA条形码参考数据库对于单个标本的鉴定和元条形码数据的解释都是必不可少的。在植物中,核(ITS)和质体(rbcL, matK)标记通常一起使用。因为质体区域是蛋白质编码基因的片段,它们的排列和分析通常很简单。相比之下,ITS记录的组装和验证要困难得多,原因有两个:索引的普遍存在和个体内部序列的变化。这种复杂性激发了一些工作流程的发展,以支持植物条形码内部转录间隔区(ITS)区域参考数据库的管理。然而,用于创建这些数据库的管道缺乏确保可靠的分析后验证所必需的功能。本文提出了一种新的工作流程来解决这些问题,旨在提高植物条形码研究的可靠性和准确性。我们进一步证明,参考数据库的聚类导致获得正确物种级分配的查询比例大幅下降。相比之下,根据查询和匹配之间的距离为识别设置一个接受阈值,可以显著降低不完整参考数据库中的错误率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Automating the Curation of DNA Barcode Databases for Vascular Plants

Automating the Curation of DNA Barcode Databases for Vascular Plants

Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Environmental DNA
Environmental DNA Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics
CiteScore
11.00
自引率
0.00%
发文量
99
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信