nf-core/detaxizer: a benchmarking study for decontamination from human sequences.

IF 2.8 Q1 GENETICS & HEREDITY
NAR Genomics and Bioinformatics Pub Date : 2025-09-23 eCollection Date: 2025-09-01 DOI:10.1093/nargab/lqaf125
Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen
{"title":"nf-core/detaxizer: a benchmarking study for decontamination from human sequences.","authors":"Jannik Seidel, Camill Kaipf, Daniel Straub, Sven Nahnsen","doi":"10.1093/nargab/lqaf125","DOIUrl":null,"url":null,"abstract":"<p><p>Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering <i>Homo sapiens</i> reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 3","pages":"lqaf125"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12455401/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Privacy is paramount in health data, particularly in human genetics, where information extends beyond individuals to their relatives. Metagenomic datasets contain substantial human genetic material, necessitating careful handling to mitigate data leakage risks when sharing or publishing. The same applies to genetic datasets from the environment or datasets from contaminated laboratory samples, although to a lesser extent. Completely removing human sequence data while retaining unbiased nonhuman reads is not achievable currently, but several tools exist. To address these topics, we developed nf-core/detaxizer, a nextflow-based pipeline that employs Kraken2 and bbmap/bbduk for taxonomic classification, identifying and optionally filtering Homo sapiens reads. Due to its generalized design, other taxa can also be classified and filtered. We benchmark its filtering efficacy for human reads against Hostile and CLEAN, demonstrating its utility for secure data preprocessing. The comparison showed that the choice of tool and database can result in differences of up to an order of magnitude in both the amount of human data not removed and the amount of microbial data mistakenly removed. As part of the nf-core initiative, nf-core/detaxizer adheres to best practices, leveraging containerized dependencies for streamlined installation. The source code is openly available under the MIT license: https://github.com/nf-core/detaxizer.

Nf-core /去氧剂:人类序列去污的基准研究。
在健康数据中,隐私至关重要,特别是在人类遗传学中,信息从个人延伸到其亲属。宏基因组数据集包含大量的人类遗传物质,需要谨慎处理,以减轻共享或发布时的数据泄露风险。这同样适用于来自环境的遗传数据集或来自受污染的实验室样本的数据集,尽管程度较轻。完全去除人类序列数据,同时保留无偏的非人类读数目前还无法实现,但有几种工具存在。为了解决这些问题,我们开发了nf-core/detaxizer,这是一个基于nextflow的管道,使用Kraken2和bbmap/bbduk进行分类分类,识别和选择性过滤智人的阅读。由于其一般化的设计,其他分类群也可以被分类和过滤。我们将其对人类读取的过滤效果与Hostile和CLEAN进行基准测试,展示其对安全数据预处理的效用。比较表明,工具和数据库的选择可能导致未删除的人类数据量和错误删除的微生物数据量的差异高达一个数量级。作为nf-core计划的一部分,nf-core/detaxizer遵循最佳实践,利用容器化依赖项来简化安装。源代码在MIT许可下是公开的:https://github.com/nf-core/detaxizer。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
2.20%
发文量
95
审稿时长
15 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信