nf-core/detaxizer: a benchmarking study for decontamination from human sequences.

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2025-09-23 eCollection Date: 2025-09-01 DOI:10.1093/nargab/lqaf125

Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen

{"title":"nf-core/detaxizer: a benchmarking study for decontamination from human sequences.","authors":"Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen","doi":"10.1093/nargab/lqaf125","DOIUrl":null,"url":null,"abstract":"Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering Homo sapiens reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf125"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12455401/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering Homo sapiens reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.

查看原文本刊更多论文

Nf-core /去氧剂：人类序列去污的基准研究。

在健康数据中，隐私至关重要，特别是在人类遗传学中，信息从个人延伸到其亲属。宏基因组数据集包含大量的人类遗传物质，需要谨慎处理，以减轻共享或发布时的数据泄露风险。这同样适用于来自环境的遗传数据集或来自受污染的实验室样本的数据集，尽管程度较轻。完全去除人类序列数据，同时保留无偏的非人类读数目前还无法实现，但有几种工具存在。为了解决这些问题，我们开发了nf-core/detaxizer，这是一个基于nextflow的管道，使用Kraken2和bbmap/bbduk进行分类分类，识别和选择性过滤智人的阅读。由于其一般化的设计，其他分类群也可以被分类和过滤。我们将其对人类读取的过滤效果与Hostile和CLEAN进行基准测试，展示其对安全数据预处理的效用。比较表明，工具和数据库的选择可能导致未删除的人类数据量和错误删除的微生物数据量的差异高达一个数量级。作为nf-core计划的一部分，nf-core/detaxizer遵循最佳实践，利用容器化依赖项来简化安装。源代码在MIT许可下是公开的：https://github.com/nf-core/detaxizer。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊