Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2024-04-04 DOI:10.1093/gigascience/giae010

Michael B Hall, Lachlan J M Coin

{"title":"Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data","authors":"Michael B Hall, Lachlan J M Coin","doi":"10.1093/gigascience/giae010","DOIUrl":null,"url":null,"abstract":"Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"244 1","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae010","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.

查看原文本刊更多论文

泛基因组数据库改进了临床元基因组数据中的宿主去除和分枝杆菌分类工作

背景临床元基因组样本的无培养基实时测序可实现病原体的快速检测和抗菌药耐药性分析。然而，这种方法会带来病人 DNA 泄漏的风险。为了降低这种风险，我们需要在测序时近乎全面地清除人类 DNA 序列，通常需要使用资源有限的设备。现有的基准主要集中在标准化数据库的使用上，在很大程度上忽略了删除管道的计算要求以及人类基因组多样性的影响。结果我们在模拟和人工真实 Illumina 和 Nanopore 元基因组样本上对宿主去除管道进行了基准测试。我们发现，构建一个包含不同人类基因组的定制 kraken 数据库，能在准确性和计算资源使用之间取得最佳平衡。此外，我们还利用标准数据库和定制数据库，对使用 kraken 和 minimap2 对分枝杆菌读数进行分类的管道进行了基准测试。与结核分枝杆菌分类的标准数据库相比，使用具有代表性的分枝杆菌属数据库，这两种工具都提高了特异性和灵敏度。这些定制数据库的计算效率优于大多数标准方法，可以在笔记本电脑上执行。结论与标准数据库相比，定制的泛基因组数据库在从元基因组样本中去除人类读数和进行结核分枝杆菌读数分类时，能在准确性和计算效率之间取得最佳平衡。这样的数据库可以在笔记本电脑上执行，而不会牺牲准确性，这在资源匮乏的环境中是一个特别重要的考虑因素。我们免费提供所有定制的数据库和管道。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.