A flexible mixed-membership model for community and enterotype detection for microbiome data

IF 1.6 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis Pub Date : 2025-04-04 DOI:10.1016/j.csda.2025.108181

Alice Giampino, Roberto Ascari, Sonia Migliorati

{"title":"A flexible mixed-membership model for community and enterotype detection for microbiome data","authors":"Alice Giampino, Roberto Ascari, Sonia Migliorati","doi":"10.1016/j.csda.2025.108181","DOIUrl":null,"url":null,"abstract":"<div><div>Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108181"},"PeriodicalIF":1.6000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016794732500057X","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.

查看原文本刊更多论文

一种灵活的混合成员模型，用于微生物组数据的社区和肠型检测

了解人类肠道微生物组如何影响宿主健康是具有挑战性的，因为微生物组数据具有广泛的个体间变异性、稀疏性和高维性。混合隶属模型以前已应用于这些数据，以检测潜在的群落细菌分类群，预计共同发生。应用最广泛的混合隶属度模型是潜狄利克雷分配（LDA）模型。然而，LDA受到施加在群落比例上的Dirichlet分布的刚性的限制，这阻碍了它对依赖关系建模和解释过度分散的能力。为了解决这一限制，提出了LDA的推广，通过结合柔性狄利克雷（FD），一种具有狄利克雷分量的特定可识别混合物，为协方差矩阵引入了更大的灵活性。除了鉴定群落外，新模型还能够检测肠道型，即具有相似微生物组成的样品簇。为了推理的目的，提出了一种计算效率高的折叠吉布斯采样器，它利用了FD分布相对于多项模型的共轭性。仿真研究表明，该模型能够通过最小化真实值与估估值之间的适当成分差异来准确地恢复真实参数值。此外，该模型正确地识别了社区的数量，正如困惑分数所证明的那样。此外，对COMBO数据集的应用突出了它在检测生物学上重要和连贯的群落和肠道型方面的有效性，揭示了群落丰度之间更广泛的相关性。这些结果表明，新模型比LDA有了明显的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Statistics & Data Analysis 数学-计算机：跨学科应用

CiteScore

3.70

自引率

5.60%

发文量

167

审稿时长

60 days

期刊介绍： Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures. [...] III) Special Applications - [...] IV) Annals of Statistical Data Science [...]