Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar
{"title":"基于半监督投影模型聚类的多元有界支持Kotz混合模型","authors":"Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar","doi":"10.1016/j.inffus.2025.103330","DOIUrl":null,"url":null,"abstract":"<div><div>Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103330"},"PeriodicalIF":14.7000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multivariate bounded support Kotz mixture model with semi-supervised projected model-based clustering\",\"authors\":\"Tsega Weldu Araya , Muhammad Azam , Nizar Bouguila , Jamal Bentahar\",\"doi\":\"10.1016/j.inffus.2025.103330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"124 \",\"pages\":\"Article 103330\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525004038\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004038","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multivariate bounded support Kotz mixture model with semi-supervised projected model-based clustering
Data clustering is a crucial technique in data analysis, aimed at identifying and grouping similar data points to uncover underlying structures within a dataset. We propose a new unsupervised clustering approach using a multivariate bounded Kotz mixture model (BKMM) for data modeling when the data lie within a bounded support region. In many real applications, BKMM effectively handles observed data that fall within these limits, accurately modeling and clustering the observations. In BKMM, parameter estimation is performed by maximizing the log-likelihood using Expectation–Maximization (EM) algorithm and the Newton–Raphson method. Additionally, we explore the enhancements in clustering performance through semi-supervised learning by incorporating a small amount of labeled data to guide the clustering process. Thus, we propose a bounded Kotz mixture model using a semi-supervised projected model-based clustering method (BKMM-SeSProC) to obtain hidden cluster labels. Model selection in mixtures is essential for determining the optimal number of mixture components, and we introduce a minimum message length (MML) model selection criterion to find the best number of clusters in the BKMM-SeSProC approach. A greedy forward search is applied to estimate the optimal number of clusters. We use the same datasets to evaluate our proposed models, BKMM and BKMM-SeSProC, for data clustering. Additionally, we utilize MML model selection with BKMM-SeSProC to determine the number of components. Initially, we validate both proposed models and the model selection process in various medical applications. Furthermore, to assess their broader performance, we test the models on image datasets, including Alzheimer’s disease, lung tissue, and gastrointestinal tract images for disease recognition, and the CIFAR-100 dataset for object categorization. BKMM is compared with the Kotz mixture model (KMM), Student’s t mixture model (SMM), Laplace mixture model (LMM), bounded Gaussian mixture model (BGMM), and Gaussian mixture model (GMM) under similar experimental settings across all datasets. To evaluate the performance of BKMM and BKMM-SeSProC, several performance metrics are employed. To find the best number of clusters for BKMM-SeSProC, we examine the effectiveness of MML model selection against seven different criteria. The experimental results demonstrate that the proposed BKMM outperforms the compared models, KMM, SMM, LMM, BGMM, and GMM, in all applications. Additionally, the semi-supervised projected model-based clustering shows better performance across all evaluation metrics compared to unsupervised BKMM.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.