Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2

IF 5 4区农林科学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

aBIOTECH Pub Date : 2024-07-31 DOI:10.1007/s42994-024-00178-0

Yunlong Liu, Morteza H. Ghaffari, Tao Ma, Yan Tu

{"title":"Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2","authors":"Yunlong Liu, Morteza H. Ghaffari, Tao Ma, Yan Tu","doi":"10.1007/s42994-024-00178-0","DOIUrl":null,"url":null,"abstract":"<div><p>Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.</p></div>","PeriodicalId":53135,"journal":{"name":"aBIOTECH","volume":"5 4","pages":"465 - 475"},"PeriodicalIF":5.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s42994-024-00178-0.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"aBIOTECH","FirstCategoryId":"1091","ListUrlMain":"https://link.springer.com/article/10.1007/s42994-024-00178-0","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.

查看原文本刊更多论文

数据库选择和置信度对Kraken2分类性能的影响

准确的分类是通过宏基因组测序了解微生物多样性和功能的必要条件。然而，由于微生物基因组的多样性和生物信息学工具的计算限制，这项任务变得复杂。本研究的目的是评估参考数据库选择和置信度评分（CS）设置对Kraken2性能的影响，Kraken2是一种广泛使用的基于k-mer的宏基因组分类器。在这项研究中，我们生成了模拟宏基因组数据集，系统地评估了参考数据库的选择，从紧凑的Minikraken v1到扩展的nt-和GTDB r202，以及不同的CS（从0到1.0）如何影响Kraken2的关键性能指标。这些指标包括分类率、精确度、召回率、F1分数和真实的细菌丰度估计与计算的细菌丰度估计的准确性。结果表明，较高的CS要求较高的k-mer一致性，从而增加了分类的严谨性，但通常会降低分类率。这种影响在Minikraken和Standard-16等较小的数据库中尤为明显，当CS高于0.4时，没有读取可以被分类。相比之下，对于较大的数据库，如Standard， nt和GTDB r202，精度和F1分数随着CS的增加而显著提高，突出了它们对严格条件的鲁棒性。回收率基本稳定，表明在不同CS设置下检测到的物种一致。重要的是，结果表明，综合参考数据库结合中等CS（0.2或0.4）显著提高了分类精度和灵敏度。这一发现强调需要仔细选择数据库和CS参数，以针对特定的科学问题和可用的计算资源来优化宏基因组分析的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

aBIOTECH

CiteScore

7.70

自引率

2.80%

发文量