MIDASim: a fast and simple simulator for realistic microbiome data

IF 13.8 1区 生物学 Q1 MICROBIOLOGY
Mengyu He, Ni Zhao, Glen A. Satten
{"title":"MIDASim: a fast and simple simulator for realistic microbiome data","authors":"Mengyu He, Ni Zhao, Glen A. Satten","doi":"10.1186/s40168-024-01822-z","DOIUrl":null,"url":null,"abstract":"Advances in sequencing technology has led to the discovery of associations between the human microbiota and many diseases, conditions, and traits. With the increasing availability of microbiome data, many statistical methods have been developed for studying these associations. The growing number of newly developed methods highlights the need for simple, rapid, and reliable methods to simulate realistic microbiome data, which is essential for validating and evaluating the performance of these methods. However, generating realistic microbiome data is challenging due to the complex nature of microbiome data, which feature correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for simulating microbiome data are deficient in their ability to capture these important features of microbiome data, or can require exorbitant computational time. We develop MIDASim (MIcrobiome DAta Simulator), a fast and simple approach for simulating realistic microbiome data that reproduces the distributional and correlation structure of a template microbiome dataset. MIDASim is a two-step approach. The first step generates correlated binary indicators that represent the presence-absence status of all taxa, and the second step generates relative abundances and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to account for the taxon-taxon correlations. In the second step, MIDASim can operate in both a nonparametric and parametric mode. In the nonparametric mode, the Gaussian copula uses the empirical distribution of relative abundances for the marginal distributions. In the parametric mode, a generalized gamma distribution is used in place of the empirical distribution. We demonstrate improved performance of MIDASim relative to other existing methods using gut and vaginal data. MIDASim showed superior performance by PERMANOVA and in terms of alpha diversity and beta dispersion in either parametric or nonparametric mode. We also show how MIDASim in parametric mode can be used to assess the performance of methods for finding differentially abundant taxa in a compositional model. MIDASim is easy to implement, flexible and suitable for most microbiome data simulation situations. MIDASim has three major advantages. First, MIDASim performs better in reproducing the distributional features of real data compared to other methods, at both the presence-absence level and the relative-abundance level. MIDASim-simulated data are more similar to the template data than competing methods, as quantified using a variety of measures. Second, MIDASim makes few distributional assumptions for the relative abundances, and thus can easily accommodate complex distributional features in real data. Third, MIDASim is computationally efficient and can be used to simulate large microbiome datasets. ","PeriodicalId":18447,"journal":{"name":"Microbiome","volume":null,"pages":null},"PeriodicalIF":13.8000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbiome","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s40168-024-01822-z","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in sequencing technology has led to the discovery of associations between the human microbiota and many diseases, conditions, and traits. With the increasing availability of microbiome data, many statistical methods have been developed for studying these associations. The growing number of newly developed methods highlights the need for simple, rapid, and reliable methods to simulate realistic microbiome data, which is essential for validating and evaluating the performance of these methods. However, generating realistic microbiome data is challenging due to the complex nature of microbiome data, which feature correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for simulating microbiome data are deficient in their ability to capture these important features of microbiome data, or can require exorbitant computational time. We develop MIDASim (MIcrobiome DAta Simulator), a fast and simple approach for simulating realistic microbiome data that reproduces the distributional and correlation structure of a template microbiome dataset. MIDASim is a two-step approach. The first step generates correlated binary indicators that represent the presence-absence status of all taxa, and the second step generates relative abundances and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to account for the taxon-taxon correlations. In the second step, MIDASim can operate in both a nonparametric and parametric mode. In the nonparametric mode, the Gaussian copula uses the empirical distribution of relative abundances for the marginal distributions. In the parametric mode, a generalized gamma distribution is used in place of the empirical distribution. We demonstrate improved performance of MIDASim relative to other existing methods using gut and vaginal data. MIDASim showed superior performance by PERMANOVA and in terms of alpha diversity and beta dispersion in either parametric or nonparametric mode. We also show how MIDASim in parametric mode can be used to assess the performance of methods for finding differentially abundant taxa in a compositional model. MIDASim is easy to implement, flexible and suitable for most microbiome data simulation situations. MIDASim has three major advantages. First, MIDASim performs better in reproducing the distributional features of real data compared to other methods, at both the presence-absence level and the relative-abundance level. MIDASim-simulated data are more similar to the template data than competing methods, as quantified using a variety of measures. Second, MIDASim makes few distributional assumptions for the relative abundances, and thus can easily accommodate complex distributional features in real data. Third, MIDASim is computationally efficient and can be used to simulate large microbiome datasets.
MIDASim:快速、简单的模拟器,用于模拟现实微生物组数据
测序技术的进步导致人们发现了人类微生物群与许多疾病、状况和特征之间的关联。随着微生物组数据的不断增加,人们开发了许多统计方法来研究这些关联。新开发的方法越来越多,这突出表明需要简单、快速、可靠的方法来模拟真实的微生物组数据,这对验证和评估这些方法的性能至关重要。然而,由于微生物组数据性质复杂,具有类群间相关性、稀疏性、过度分散性和组成性等特点,因此生成真实的微生物组数据具有挑战性。目前模拟微生物组数据的方法不足以捕捉微生物组数据的这些重要特征,或者需要耗费大量的计算时间。我们开发了 MIDASim(MIcrobiome DAta Simulator),这是一种快速、简单的模拟现实微生物组数据的方法,可重现模板微生物组数据集的分布和相关结构。MIDASim 分两步进行。第一步生成代表所有类群存在与否状态的相关二元指标,第二步生成在第一步中被认为存在的类群的相对丰度和计数,利用高斯共线来解释类群间的相关性。在第二步中,MIDASim 可以在非参数模式和参数模式下运行。在非参数模式下,高斯共线公式使用相对丰度的经验分布来计算边际分布。在参数模式下,则使用广义伽马分布来代替经验分布。我们使用肠道和阴道数据证明了 MIDASim 相对于其他现有方法的性能提升。无论是在参数模式还是非参数模式下,MIDASim 在 PERMANOVA 以及α多样性和β分散性方面都表现出更优越的性能。我们还展示了参数模式下的 MIDASim 如何用于评估在组成模型中寻找差异丰富类群的方法的性能。MIDASim 易于实施、灵活,适用于大多数微生物组数据模拟情况。MIDASim 有三大优势。首先,与其他方法相比,MIDASim 能在存在-不存在水平和相对丰度水平上更好地再现真实数据的分布特征。与其他方法相比,MIDASim 模拟的数据与模板数据更为相似,这可以用多种方法来量化。其次,MIDASim 对相对丰度的分布假设很少,因此很容易适应真实数据中复杂的分布特征。第三,MIDASim 计算效率高,可用于模拟大型微生物组数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Microbiome
Microbiome MICROBIOLOGY-
CiteScore
21.90
自引率
2.60%
发文量
198
审稿时长
4 weeks
期刊介绍: Microbiome is a journal that focuses on studies of microbiomes in humans, animals, plants, and the environment. It covers both natural and manipulated microbiomes, such as those in agriculture. The journal is interested in research that uses meta-omics approaches or novel bioinformatics tools and emphasizes the community/host interaction and structure-function relationship within the microbiome. Studies that go beyond descriptive omics surveys and include experimental or theoretical approaches will be considered for publication. The journal also encourages research that establishes cause and effect relationships and supports proposed microbiome functions. However, studies of individual microbial isolates/species without exploring their impact on the host or the complex microbiome structures and functions will not be considered for publication. Microbiome is indexed in BIOSIS, Current Contents, DOAJ, Embase, MEDLINE, PubMed, PubMed Central, and Science Citations Index Expanded.
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信