Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang
{"title":"Co-clustering multi-view data using the Latent Block Model","authors":"Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang","doi":"10.1016/j.csda.2025.108188","DOIUrl":null,"url":null,"abstract":"<div><div>The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108188"},"PeriodicalIF":1.5000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947325000647","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.
期刊介绍:
Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas:
I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article.
II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures.
[...]
III) Special Applications - [...]
IV) Annals of Statistical Data Science [...]