{"title":"Standardization of expected value in gap statistic using Gaussian distribution for optimal number of clusters selection in K-means","authors":"Iliyas Karim Khan , Hanita Binti Daud , Nooraini Binti Zainuddin , Rajalingam Sokkalingam , Noor Naheed , Aftab Alam Janisar , Agha Inayat , Md Shohel Rana","doi":"10.1016/j.eij.2025.100701","DOIUrl":null,"url":null,"abstract":"<div><div>K-means clustering is a widely used unsupervised learning technique for partitioning data into distinct groups. However, determining the Optimal Number of Clusters (ONC) remains a significant challenge due to the subjective nature of existing methods. The Gap Statistic is a common approach for ONC selection, yet it has limitations across various data scenarios. To address these challenges, this paper introduces the Enhanced Gap Statistic (EGS), which improves the traditional Gap Statistic by incorporating a Gaussian distribution to standardize reference data and integrating an adjustment factor to enhance ONC selection accuracy. In this study, we apply the Gaussian distribution to generate the reference dataset in the Gap Statistic due to its stability, efficiency, and robustness in handling outliers. While Gaussian assumptions work well in many cases, we acknowledge that they may not always be suitable, particularly for skewed, heavy-tailed, or multimodal data. In such scenarios, alternative approaches, such as t-distribution and kernel density estimation, may provide better adaptability. Furthermore, we recognize that the computational complexity of incorporating Gaussian standardization could impact scalability for large datasets, necessitating further optimizations. To evaluate EGS, we compare its performance against widely used clustering validation metrics, including the Davies-Bouldin index, Calinski-Harabasz index, Silhouette index, Elbow curve, and the conventional Gap Statistic. Experimental results demonstrate that EGS consistently outperforms traditional methods in both accuracy and computational efficiency. Specifically, EGS achieved efficiency values of 0.0500, 0.21, 0.12, 3.340 and 4.34 s and accuracy values of 89.35 %, 95.35 %, 80.35 %, 74.3 % and 97.3 for Time Series, Well Log, Hitter, the large-scale Traffic Crash dataset and the high dimensional Darwin Dataset, respectively. The findings shows that EGS as a highly effective and computationally efficient method for ONC selection, making it a valuable tool for complex and large-scale data environments.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100701"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000945","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
K-means clustering is a widely used unsupervised learning technique for partitioning data into distinct groups. However, determining the Optimal Number of Clusters (ONC) remains a significant challenge due to the subjective nature of existing methods. The Gap Statistic is a common approach for ONC selection, yet it has limitations across various data scenarios. To address these challenges, this paper introduces the Enhanced Gap Statistic (EGS), which improves the traditional Gap Statistic by incorporating a Gaussian distribution to standardize reference data and integrating an adjustment factor to enhance ONC selection accuracy. In this study, we apply the Gaussian distribution to generate the reference dataset in the Gap Statistic due to its stability, efficiency, and robustness in handling outliers. While Gaussian assumptions work well in many cases, we acknowledge that they may not always be suitable, particularly for skewed, heavy-tailed, or multimodal data. In such scenarios, alternative approaches, such as t-distribution and kernel density estimation, may provide better adaptability. Furthermore, we recognize that the computational complexity of incorporating Gaussian standardization could impact scalability for large datasets, necessitating further optimizations. To evaluate EGS, we compare its performance against widely used clustering validation metrics, including the Davies-Bouldin index, Calinski-Harabasz index, Silhouette index, Elbow curve, and the conventional Gap Statistic. Experimental results demonstrate that EGS consistently outperforms traditional methods in both accuracy and computational efficiency. Specifically, EGS achieved efficiency values of 0.0500, 0.21, 0.12, 3.340 and 4.34 s and accuracy values of 89.35 %, 95.35 %, 80.35 %, 74.3 % and 97.3 for Time Series, Well Log, Hitter, the large-scale Traffic Crash dataset and the high dimensional Darwin Dataset, respectively. The findings shows that EGS as a highly effective and computationally efficient method for ONC selection, making it a valuable tool for complex and large-scale data environments.
期刊介绍:
The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.