基于半监督投影模型聚类的多元有界支持Kotz混合模型

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-06-13 DOI:10.1016/j.inffus.2025.103330

Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar

{"title":"基于半监督投影模型聚类的多元有界支持Kotz混合模型","authors":"Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar","doi":"10.1016/j.inffus.2025.103330","DOIUrl":null,"url":null,"abstract":"<div><div>Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103330"},"PeriodicalIF":14.7000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multivariate bounded support Kotz mixture model with semi-supervised projected model-based clustering\",\"authors\":\"Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar\",\"doi\":\"10.1016/j.inffus.2025.103330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"124 \",\"pages\":\"Article 103330\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525004038\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004038","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

数据聚类是数据分析中的一项关键技术，旨在识别和分组相似的数据点，以揭示数据集中的底层结构。我们提出了一种新的无监督聚类方法，当数据位于有界支持区域内时，使用多元有界Kotz混合模型（BKMM）进行数据建模。在许多实际应用中，BKMM有效地处理在这些限制范围内的观测数据，准确地对观测进行建模和聚类。在BKMM中，参数估计采用期望最大化（EM）算法和Newton-Raphson方法通过最大化对数似然来实现。此外，我们还通过引入少量标记数据来指导聚类过程，探索了通过半监督学习来增强聚类性能的方法。因此，我们提出了一种基于半监督投影模型的聚类方法（BKMM-SeSProC）的有界Kotz混合模型，以获得隐藏的聚类标签。在BKMM-SeSProC方法中，引入最小消息长度（minimum message length， MML）模型选择准则来寻找最佳簇数。采用贪婪正向搜索估计最优簇数。我们使用相同的数据集来评估我们提出的模型BKMM和BKMM- sesproc，用于数据聚类。此外，我们利用BKMM-SeSProC的MML模型选择来确定组件的数量。首先，我们在各种医学应用中验证了所提出的模型和模型选择过程。此外，为了评估其更广泛的性能，我们在图像数据集上测试了这些模型，包括用于疾病识别的阿尔茨海默病、肺组织和胃肠道图像，以及用于对象分类的CIFAR-100数据集。在所有数据集的相似实验设置下，将BKMM与Kotz混合模型（KMM）、Student’s t混合模型（SMM）、Laplace混合模型（LMM）、有界高斯混合模型（BGMM）和高斯混合模型（GMM）进行比较。为了评估BKMM和BKMM- sesproc的性能，采用了几个性能指标。为了找到BKMM-SeSProC的最佳聚类数量，我们根据七个不同的标准检查了MML模型选择的有效性。实验结果表明，所提出的BKMM模型在所有应用中都优于KMM、SMM、LMM、BGMM和GMM模型。此外，与无监督的BKMM相比，半监督的基于投影模型的聚类在所有评估指标上都表现出更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multivariate bounded support Kotz mixture model with semi-supervised projected model-based clustering

Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.