An integrated method for clustering and association network inference

IF 1.6 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis Pub Date : 2026-07-01 Epub Date: 2026-01-28 DOI:10.1016/j.csda.2026.108347

Jeanne Tous, Julien Chiquet

{"title":"An integrated method for clustering and association network inference","authors":"Jeanne Tous, Julien Chiquet","doi":"10.1016/j.csda.2026.108347","DOIUrl":null,"url":null,"abstract":"<div><div>High dimensional Gaussian graphical models provide a rigorous framework to describe a network of statistical dependencies between variables, such as genes in genomic regulation studies or species in ecology. Penalized methods, including the standard Graphical-Lasso, are well-known approaches to infer the parameters of these models. As the number of variables in the model grow, the network inference and interpretation become more complex. The Normal-Block model is discussed, a model that clusters variables and considers a network at the cluster level. This both adds structure to the network and reduces the number of parameters at stake, thereby easing the inference and interpretation of the underlying network. The approach builds on Graphical-Lasso to add a penalty on the network’s edges and limit the detection of spurious dependencies. A zero-inflated version of the model is also proposed to account for real-world data properties. For the inference procedure, two approaches are introduced, a two-step method based on existing approaches and an original, more rigorous method that simultaneously infers the clustering of variables and the association network between clusters, using a penalized variational Expectation-Maximization approach. An implementation of the model in R, in a package called <strong>normalblockr</strong>, is available on github<span><span><sup>1</sup></span></span>. The results of the models in terms of clustering and network inference are presented, using both simulated data and various types of real-world data (proteomics and words occurrences on webpages).</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"219 ","pages":"Article 108347"},"PeriodicalIF":1.6000,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947326000095","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/28 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

High dimensional Gaussian graphical models provide a rigorous framework to describe a network of statistical dependencies between variables, such as genes in genomic regulation studies or species in ecology. Penalized methods, including the standard Graphical-Lasso, are well-known approaches to infer the parameters of these models. As the number of variables in the model grow, the network inference and interpretation become more complex. The Normal-Block model is discussed, a model that clusters variables and considers a network at the cluster level. This both adds structure to the network and reduces the number of parameters at stake, thereby easing the inference and interpretation of the underlying network. The approach builds on Graphical-Lasso to add a penalty on the network’s edges and limit the detection of spurious dependencies. A zero-inflated version of the model is also proposed to account for real-world data properties. For the inference procedure, two approaches are introduced, a two-step method based on existing approaches and an original, more rigorous method that simultaneously infers the clustering of variables and the association network between clusters, using a penalized variational Expectation-Maximization approach. An implementation of the model in R, in a package called normalblockr, is available on github¹. The results of the models in terms of clustering and network inference are presented, using both simulated data and various types of real-world data (proteomics and words occurrences on webpages).

查看原文本刊更多论文

一种聚类与关联网络推理的集成方法

高维高斯图形模型提供了一个严格的框架来描述变量之间的统计依赖网络，例如基因组调控研究中的基因或生态学中的物种。惩罚方法，包括标准的Graphical-Lasso，是众所周知的推断这些模型参数的方法。随着模型中变量数量的增加，网络推理和解释变得更加复杂。讨论了Normal-Block模型，该模型将变量聚类并在聚类级别考虑网络。这既增加了网络的结构，又减少了相关参数的数量，从而简化了对底层网络的推断和解释。该方法建立在graphiclasso的基础上，在网络的边缘上增加了惩罚，并限制了对虚假依赖的检测。还提出了模型的零膨胀版本，以考虑现实世界的数据属性。对于推理过程，介绍了两种方法，一种是基于现有方法的两步方法，另一种是使用惩罚变分期望最大化方法同时推断变量的聚类和聚类之间的关联网络的原始的，更严格的方法。该模型在R中的实现，在一个名为normalblockr的包中，可以在github1上获得。使用模拟数据和各种类型的现实世界数据（蛋白质组学和网页上的单词出现），给出了模型在聚类和网络推理方面的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Statistics & Data Analysis 数学-计算机：跨学科应用

CiteScore

3.70

自引率

5.60%

发文量

167

审稿时长

60 days

期刊介绍： Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures. [...] III) Special Applications - [...] IV) Annals of Statistical Data Science [...]