精确和可扩展的宏基因组分析与样本定制的最小化库。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2025-06-09 eCollection Date: 2025-06-01 DOI:10.1093/nargab/lqaf076

Johan Nyström-Persson, Nishad Bapatdhar, Samik Ghosh

{"title":"精确和可扩展的宏基因组分析与样本定制的最小化库。","authors":"Johan Nyström-Persson, Nishad Bapatdhar, Samik Ghosh","doi":"10.1093/nargab/lqaf076","DOIUrl":null,"url":null,"abstract":"Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-lowest common ancestor classification algorithm from the widely used Kraken 2. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 'strain madness' samples, the fraction of reads classified at species level increased by 3.5×, while for in silico samples, it increased by 2.2×. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf076"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147018/pdf/","citationCount":"0","resultStr":"{\"title\":\"Precise and scalable metagenomic profiling with sample-tailored minimizer libraries.\",\"authors\":\"Johan Nyström-Persson, Nishad Bapatdhar, Samik Ghosh\",\"doi\":\"10.1093/nargab/lqaf076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-lowest common ancestor classification algorithm from the widely used Kraken 2. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 'strain madness' samples, the fraction of reads classified at species level increased by 3.5×, while for in silico samples, it increased by 2.2×. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"7 2\",\"pages\":\"lqaf076\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147018/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqaf076\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

基于参考的宏基因组分析需要庞大的基因组文库来最大限度地检测和减少假阳性。然而，随着文库的增长，分类准确性受到影响，特别是在基于k-mer的工具中，因为生物之间基因组区域的重叠越来越多，导致更高级别的分类分配，从而降低了精度。为了解决这个问题，我们提出了样本定制的最小化库，它改进了广泛使用的Kraken 2中的最小化-最低共同祖先分类算法。在该方法中，初始过滤步骤使用大型库来去除不相似的基因组，然后使用动态构建的更小的最小化库进行精细分类。与现有的分类方法相比，这种两步分类方法显示了显著的性能改进。我们开发了一个新的计算工具，Slacken，一个基于Apache Spark的分布式和高度可扩展的平台，来实现两步分类方法，这提高了速度，同时保持每个样本的成本与Kraken 2相当。具体而言，CAMI2“菌株疯狂”样本中，在物种水平上分类的reads比例增加了3.5倍，而在硅样品中，这一比例增加了2.2倍。两步法实现了大型基因组文库的敏感性和较小基因组文库的特异性，释放了大型参考文库用于宏基因组读取分析的真正潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Precise and scalable metagenomic profiling with sample-tailored minimizer libraries.

Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-lowest common ancestor classification algorithm from the widely used Kraken 2. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 'strain madness' samples, the fraction of reads classified at species level increased by 3.5×, while for in silico samples, it increased by 2.2×. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NAR Genomics and Bioinformatics Multiple-

CiteScore

8.00

自引率

2.20%

发文量

审稿时长

15 weeks