A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2024-09-25 DOI:10.1186/s13059-024-03390-9

Jakob Wirbel, Morgan Essex, Sofia Kirke Forslund, Georg Zeller

{"title":"A realistic benchmark for differential abundance testing and confounder adjustment in human microbiome studies","authors":"Jakob Wirbel, Morgan Essex, Sofia Kirke Forslund, Georg Zeller","doi":"10.1186/s13059-024-03390-9","DOIUrl":null,"url":null,"abstract":"In microbiome disease association studies, it is a fundamental task to test which microbes differ in their abundance between groups. Yet, consensus on suitable or optimal statistical methods for differential abundance testing is lacking, and it remains unexplored how these cope with confounding. Previous differential abundance benchmarks relying on simulated datasets did not quantitatively evaluate the similarity to real data, which undermines their recommendations. Our simulation framework implants calibrated signals into real taxonomic profiles, including signals mimicking confounders. Using several whole meta-genome and 16S rRNA gene amplicon datasets, we validate that our simulated data resembles real data from disease association studies much more than in previous benchmarks. With extensively parametrized simulations, we benchmark the performance of nineteen differential abundance methods and further evaluate the best ones on confounded simulations. Only classic statistical methods (linear models, the Wilcoxon test, t-test), limma, and fastANCOM properly control false discoveries at relatively high sensitivity. When additionally considering confounders, these issues are exacerbated, but we find that adjusted differential abundance testing can effectively mitigate them. In a large cardiometabolic disease dataset, we showcase that failure to account for covariates such as medication causes spurious association in real-world applications. Tight error control is critical for microbiome association studies. The unsatisfactory performance of many differential abundance methods and the persistent danger of unchecked confounding suggest these contribute to a lack of reproducibility among such studies. We have open-sourced our simulation and benchmarking software to foster a much-needed consolidation of statistical methodology for microbiome research.\n","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"37 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-024-03390-9","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In microbiome disease association studies, it is a fundamental task to test which microbes differ in their abundance between groups. Yet, consensus on suitable or optimal statistical methods for differential abundance testing is lacking, and it remains unexplored how these cope with confounding. Previous differential abundance benchmarks relying on simulated datasets did not quantitatively evaluate the similarity to real data, which undermines their recommendations. Our simulation framework implants calibrated signals into real taxonomic profiles, including signals mimicking confounders. Using several whole meta-genome and 16S rRNA gene amplicon datasets, we validate that our simulated data resembles real data from disease association studies much more than in previous benchmarks. With extensively parametrized simulations, we benchmark the performance of nineteen differential abundance methods and further evaluate the best ones on confounded simulations. Only classic statistical methods (linear models, the Wilcoxon test, t-test), limma, and fastANCOM properly control false discoveries at relatively high sensitivity. When additionally considering confounders, these issues are exacerbated, but we find that adjusted differential abundance testing can effectively mitigate them. In a large cardiometabolic disease dataset, we showcase that failure to account for covariates such as medication causes spurious association in real-world applications. Tight error control is critical for microbiome association studies. The unsatisfactory performance of many differential abundance methods and the persistent danger of unchecked confounding suggest these contribute to a lack of reproducibility among such studies. We have open-sourced our simulation and benchmarking software to foster a much-needed consolidation of statistical methodology for microbiome research.

查看原文本刊更多论文

人类微生物组研究中差异丰度测试和混杂因素调整的现实基准

在微生物组疾病关联研究中，一项基本任务是测试不同组别之间哪些微生物的丰度不同。然而，对于差异丰度测试的合适或最佳统计方法还缺乏共识，这些方法如何应对混杂因素也仍未得到探讨。以前的差异丰度基准依赖于模拟数据集，但并未对其与真实数据的相似性进行定量评估，这削弱了它们的建议。我们的模拟框架将经过校准的信号植入真实的分类概况中，包括模拟混杂因素的信号。通过使用几个全元基因组和 16S rRNA 基因扩增片段数据集，我们验证了我们的模拟数据与疾病关联研究的真实数据的相似性远高于之前的基准数据。通过广泛的参数化模拟，我们对 19 种差异丰度方法的性能进行了基准测试，并进一步评估了混杂模拟中的最佳方法。只有经典统计方法（线性模型、Wilcoxon 检验、t 检验）、limma 和 fastANCOM 能以相对较高的灵敏度正确控制错误发现。当额外考虑混杂因素时，这些问题就会加剧，但我们发现调整后的差异丰度检验可以有效缓解这些问题。在一个大型心脏代谢疾病数据集中，我们展示了在实际应用中，如果不考虑药物治疗等协变量，就会导致虚假关联。严格的误差控制对微生物组关联研究至关重要。许多差异丰度方法的性能不尽如人意，而且未考虑混杂因素的危险持续存在，这些都是导致此类研究缺乏可重复性的原因。我们已经开源了我们的模拟和基准测试软件，以促进微生物组研究统计方法的整合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.