维管植物DNA条形码数据库的自动化管理

IF 6.2 Q1 Agricultural and Biological Sciences

Environmental DNA Pub Date : 2025-06-26 DOI:10.1002/edn3.70125

Andreas Kolter, Paul Hebert

{"title":"维管植物DNA条形码数据库的自动化管理","authors":"Andreas Kolter, Paul Hebert","doi":"10.1002/edn3.70125","DOIUrl":null,"url":null,"abstract":"<p>Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.</p>","PeriodicalId":52828,"journal":{"name":"Environmental DNA","volume":"7 3","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/edn3.70125","citationCount":"0","resultStr":"{\"title\":\"Automating the Curation of DNA Barcode Databases for Vascular Plants\",\"authors\":\"Andreas Kolter, Paul Hebert\",\"doi\":\"10.1002/edn3.70125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.</p>\",\"PeriodicalId\":52828,\"journal\":{\"name\":\"Environmental DNA\",\"volume\":\"7 3\",\"pages\":\"\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/edn3.70125\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental DNA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/edn3.70125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental DNA","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/edn3.70125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}

引用次数: 0

摘要

全面的、精心策划的、当前的DNA条形码参考数据库对于单个标本的鉴定和元条形码数据的解释都是必不可少的。在植物中，核（ITS）和质体（rbcL, matK）标记通常一起使用。因为质体区域是蛋白质编码基因的片段，它们的排列和分析通常很简单。相比之下，ITS记录的组装和验证要困难得多，原因有两个：索引的普遍存在和个体内部序列的变化。这种复杂性激发了一些工作流程的发展，以支持植物条形码内部转录间隔区（ITS）区域参考数据库的管理。然而，用于创建这些数据库的管道缺乏确保可靠的分析后验证所必需的功能。本文提出了一种新的工作流程来解决这些问题，旨在提高植物条形码研究的可靠性和准确性。我们进一步证明，参考数据库的聚类导致获得正确物种级分配的查询比例大幅下降。相比之下，根据查询和匹配之间的距离为识别设置一个接受阈值，可以显著降低不完整参考数据库中的错误率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Automating the Curation of DNA Barcode Databases for Vascular Plants

查看原文本刊更多论文

Automating the Curation of DNA Barcode Databases for Vascular Plants

Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Environmental DNA Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics

CiteScore

11.00

自引率

0.00%

发文量

审稿时长

16 weeks