NDPD：一种用于大数据挖掘的改进的分区聚类初始质心方法

IF 2.6 Q3 MANAGEMENT

Journal of Advances in Management Research Pub Date : 2022-08-23 DOI:10.1108/jamr-07-2021-0242

K. Pandey, D. Shukla

{"title":"NDPD：一种用于大数据挖掘的改进的分区聚类初始质心方法","authors":"K. Pandey, D. Shukla","doi":"10.1108/jamr-07-2021-0242","DOIUrl":null,"url":null,"abstract":"PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.","PeriodicalId":46158,"journal":{"name":"Journal of Advances in Management Research","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NDPD: an improved initial centroid method of partitional clustering for big data mining\",\"authors\":\"K. Pandey, D. Shukla\",\"doi\":\"10.1108/jamr-07-2021-0242\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.\",\"PeriodicalId\":46158,\"journal\":{\"name\":\"Journal of Advances in Management Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2022-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Advances in Management Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1108/jamr-07-2021-0242\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MANAGEMENT\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Management Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/jamr-07-2021-0242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}

引用次数: 0

摘要

目的K-means（KM）聚类算法对初始质心的选择非常敏感，因为聚类的初始质心决定了计算的有效性、效率和局部最优问题。许多初始化策略都是通过随机和确定性地选择初始质心来克服这些问题。随机初始化策略存在聚类性能最差的局部优化问题，而确定性初始化策略的计算成本很高。大数据集群旨在降低计算成本，提高集群效率。本研究的目的是在不使用随机和确定性初始化的情况下，在商业管理数据上实现更好的大数据聚类初始质心，避免局部最优，并在聚类质量、计算成本、数据比较和单机迭代方面有效地提高聚类效率。设计/方法论/方法本研究提出了用于单机上大数据聚类的正态分布概率密度（NDPD）算法，以解决与业务管理相关的聚类问题。NDPDKM算法通过每个数据点的概率密度来解决KM聚类问题。NDPDKM算法首先通过使用数据集通过正态概率密度的平均值和标准差来识别最可能的密度数据点。此后，NDPDKM通过使用排序和线性系统采样启发法来确定K个初始质心。通过Davies-Bouldin评分、Silhouette系数、SD有效性、S_Dbw有效性、迭代次数和CPU时间验证指标，将该算法的性能与KM、KM++、Var-Part、Murat-KM、Mean-KM和Sort-KM算法进行了比较。实验评估表明，与其他算法相比，NDPDKM算法降低了迭代次数、局部最优值和计算成本，并以稳定的收敛性提高了聚类性能、有效性和效率。NDPDKM算法将平均计算时间最小化至34.83%、90.28%、71.83%、92.67%、69.53%和76.03%，并将平均迭代次数减少至40.32%、44.06%、32.02%、62.78%、19.07%和36.74%。独创性/价值KM算法是数据挖掘技术中使用最广泛的部分聚类方法，用于提取商业数据中决策策略的隐藏知识、模式和趋势。业务分析是大数据集群的应用之一，其中KM集群可用于业务分析的各个子类别，如客户细分分析、员工薪酬和绩效分析、文档搜索、交付优化、折扣和优惠分析、牧师管理、制造分析、生产力分析、，专业的员工和投资者搜索以及其他商业决策策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NDPD: an improved initial centroid method of partitional clustering for big data mining

PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Advances in Management Research MANAGEMENT-

CiteScore

6.50

自引率

3.20%

发文量