Standardization of expected value in gap statistic using Gaussian distribution for optimal number of clusters selection in K-means

IF 4.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Egyptian Informatics Journal Pub Date : 2025-05-27 DOI:10.1016/j.eij.2025.100701

Iliyas Karim Khan , Hanita Binti Daud , Nooraini Binti Zainuddin , Rajalingam Sokkalingam , Noor Naheed , Aftab Alam Janisar , Agha Inayat , Md Shohel Rana

{"title":"Standardization of expected value in gap statistic using Gaussian distribution for optimal number of clusters selection in K-means","authors":"Iliyas Karim Khan , Hanita Binti Daud , Nooraini Binti Zainuddin , Rajalingam Sokkalingam , Noor Naheed , Aftab Alam Janisar , Agha Inayat , Md Shohel Rana","doi":"10.1016/j.eij.2025.100701","DOIUrl":null,"url":null,"abstract":"<div><div>K-means clustering is a widely used unsupervised learning technique for partitioning data into distinct groups. However, determining the Optimal Number of Clusters (ONC) remains a significant challenge due to the subjective nature of existing methods. The Gap Statistic is a common approach for ONC selection, yet it has limitations across various data scenarios. To address these challenges, this paper introduces the Enhanced Gap Statistic (EGS), which improves the traditional Gap Statistic by incorporating a Gaussian distribution to standardize reference data and integrating an adjustment factor to enhance ONC selection accuracy. In this study, we apply the Gaussian distribution to generate the reference dataset in the Gap Statistic due to its stability, efficiency, and robustness in handling outliers. While Gaussian assumptions work well in many cases, we acknowledge that they may not always be suitable, particularly for skewed, heavy-tailed, or multimodal data. In such scenarios, alternative approaches, such as t-distribution and kernel density estimation, may provide better adaptability. Furthermore, we recognize that the computational complexity of incorporating Gaussian standardization could impact scalability for large datasets, necessitating further optimizations. To evaluate EGS, we compare its performance against widely used clustering validation metrics, including the Davies-Bouldin index, Calinski-Harabasz index, Silhouette index, Elbow curve, and the conventional Gap Statistic. Experimental results demonstrate that EGS consistently outperforms traditional methods in both accuracy and computational efficiency. Specifically, EGS achieved efficiency values of 0.0500, 0.21, 0.12, 3.340 and 4.34 s and accuracy values of 89.35 %, 95.35 %, 80.35 %, 74.3 % and 97.3 for Time Series, Well Log, Hitter, the large-scale Traffic Crash dataset and the high dimensional Darwin Dataset, respectively. The findings shows that EGS as a highly effective and computationally efficient method for ONC selection, making it a valuable tool for complex and large-scale data environments.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100701"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000945","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

K-means clustering is a widely used unsupervised learning technique for partitioning data into distinct groups. However, determining the Optimal Number of Clusters (ONC) remains a significant challenge due to the subjective nature of existing methods. The Gap Statistic is a common approach for ONC selection, yet it has limitations across various data scenarios. To address these challenges, this paper introduces the Enhanced Gap Statistic (EGS), which improves the traditional Gap Statistic by incorporating a Gaussian distribution to standardize reference data and integrating an adjustment factor to enhance ONC selection accuracy. In this study, we apply the Gaussian distribution to generate the reference dataset in the Gap Statistic due to its stability, efficiency, and robustness in handling outliers. While Gaussian assumptions work well in many cases, we acknowledge that they may not always be suitable, particularly for skewed, heavy-tailed, or multimodal data. In such scenarios, alternative approaches, such as t-distribution and kernel density estimation, may provide better adaptability. Furthermore, we recognize that the computational complexity of incorporating Gaussian standardization could impact scalability for large datasets, necessitating further optimizations. To evaluate EGS, we compare its performance against widely used clustering validation metrics, including the Davies-Bouldin index, Calinski-Harabasz index, Silhouette index, Elbow curve, and the conventional Gap Statistic. Experimental results demonstrate that EGS consistently outperforms traditional methods in both accuracy and computational efficiency. Specifically, EGS achieved efficiency values of 0.0500, 0.21, 0.12, 3.340 and 4.34 s and accuracy values of 89.35 %, 95.35 %, 80.35 %, 74.3 % and 97.3 for Time Series, Well Log, Hitter, the large-scale Traffic Crash dataset and the high dimensional Darwin Dataset, respectively. The findings shows that EGS as a highly effective and computationally efficient method for ONC selection, making it a valuable tool for complex and large-scale data environments.

查看原文本刊更多论文

基于高斯分布的k均值聚类选择期望值标准化研究

K-means聚类是一种广泛使用的无监督学习技术，用于将数据划分为不同的组。然而，由于现有方法的主观性，确定最佳簇数（ONC）仍然是一个重大挑战。Gap Statistic是ONC选择的常用方法，但它在各种数据场景中都有局限性。为了解决这些问题，本文引入了增强间隙统计（Enhanced Gap Statistic， EGS），该方法通过引入高斯分布来标准化参考数据，并集成调整因子来提高ONC选择的准确性。在本研究中，由于高斯分布在处理异常值方面的稳定性、效率和鲁棒性，我们应用高斯分布来生成Gap统计中的参考数据集。虽然高斯假设在许多情况下工作得很好，但我们承认它们可能并不总是合适的，特别是对于偏斜，重尾或多模态数据。在这种情况下，替代方法，如t分布和核密度估计，可能提供更好的适应性。此外，我们认识到合并高斯标准化的计算复杂性可能会影响大型数据集的可扩展性，因此需要进一步优化。为了评估EGS，我们将其与广泛使用的聚类验证指标进行了比较，包括Davies-Bouldin指数、Calinski-Harabasz指数、Silhouette指数、Elbow曲线和传统的Gap统计。实验结果表明，EGS在精度和计算效率方面都优于传统方法。具体而言，EGS在时间序列、Well Log、Hitter、大规模交通事故数据集和高维达尔文数据集上的效率值分别为0.0500、0.21、0.12、3.340和4.34 s，准确率分别为89.35%、95.35%、80.35%、74.3%和97.3%。研究结果表明，EGS是一种非常有效且计算效率高的ONC选择方法，使其成为复杂和大规模数据环境的宝贵工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.