Determination of Initial Centers in K-Means Clustering Method by NAMGY Algorithm

Meryem Goral Yildizli
{"title":"Determination of Initial Centers in K-Means Clustering Method by NAMGY Algorithm","authors":"Meryem Goral Yildizli","doi":"10.7176/jstr/7-01-05","DOIUrl":null,"url":null,"abstract":"Objective: With the development and widespread use of technology, the increasing volume of data in many areas has accelerated the digitization process. The gains obtained by processing and interpreting data stacks can make significant contributions to institutions and organizations in many managerial issues from production to decision-making processes. It has increased the use of data mining methods in different areas, which support the process of transforming digitalized large-scale data into information. One of the increasingly popular techniques in data mining is clustering, and the K-means algorithm is a non-hierarchical clustering method compatible with large amount of data. This method is widely used in the scientific studies, however the number of clusters and initial centers defined as parameters comes up a disadvantage for the algorithm, especially for those not familiar with the mathematical specificities. Initial centers those generated randomly by K-means usually make the clustering results reaching non-optimal. K-means algorithm is very sensitive in initial centers. More consistent results of K-means clustering can be achieved after computing more than one times. However, it is difficult to decide the computation limit, which can give the optimal result. An improvement of K-means algorithm with this respect will be a contribution on overcoming this disadvantage for scientific studies. In order to solve this problem; NAMGY (Neighborhood and Midpoint Gain Yield) algorithm has been developed, which includes methods that provide optimal selection of parameters according to the properties of objects. This article covers the application of the method of determining the initial centers in NAMGY algorithm. Method: In order to analyze the accuracy of our proposed method, both the standard K-means and NAMGY algorithm were applied on the classified data set those Iris, Yeast and Segment-challenge. And also the performances of the algorithms in terms of the working principle were evaluated on the VitaminB12 data set obtained from the Cukurova University Balcalı Hospital Information Management System. Euclidean distances were calculated between objects and data sets were transformed into values in the range [0, 1] using normalization. Adjusted Rand index was used to evaluate the validity of clusterings. Results: According to the examined results; the applications that reveal the effects of the initial centers on the analysis process of the algorithms have been carried out with different approaches such as the working principle of the algorithm, the effect of the initial centers on the clustering results, the evaluation of the clustering performance. It was again concluded that professional selection of parameters is requirement to increase the usability of a clustering algorithm and the reliability of clustering results. The NAMGY algorithm uses a systematic way to find initial centers which reduces the number of dataset scans and will produce better accuracy in smaller number of iteration. NAMGY algorithm has proved to be better than traditional K-means algorithm in terms of good quality results and analysis processes. According to the results generated; NAMGY provides a challenging algorithm for the disadvantage of the standard K-mean algorithm. However further research is required to verify the capability of this algorithm when applied to data sets with more complex objects.","PeriodicalId":14256,"journal":{"name":"International Journal of Scientific and Technological Research","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Scientific and Technological Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7176/jstr/7-01-05","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: With the development and widespread use of technology, the increasing volume of data in many areas has accelerated the digitization process. The gains obtained by processing and interpreting data stacks can make significant contributions to institutions and organizations in many managerial issues from production to decision-making processes. It has increased the use of data mining methods in different areas, which support the process of transforming digitalized large-scale data into information. One of the increasingly popular techniques in data mining is clustering, and the K-means algorithm is a non-hierarchical clustering method compatible with large amount of data. This method is widely used in the scientific studies, however the number of clusters and initial centers defined as parameters comes up a disadvantage for the algorithm, especially for those not familiar with the mathematical specificities. Initial centers those generated randomly by K-means usually make the clustering results reaching non-optimal. K-means algorithm is very sensitive in initial centers. More consistent results of K-means clustering can be achieved after computing more than one times. However, it is difficult to decide the computation limit, which can give the optimal result. An improvement of K-means algorithm with this respect will be a contribution on overcoming this disadvantage for scientific studies. In order to solve this problem; NAMGY (Neighborhood and Midpoint Gain Yield) algorithm has been developed, which includes methods that provide optimal selection of parameters according to the properties of objects. This article covers the application of the method of determining the initial centers in NAMGY algorithm. Method: In order to analyze the accuracy of our proposed method, both the standard K-means and NAMGY algorithm were applied on the classified data set those Iris, Yeast and Segment-challenge. And also the performances of the algorithms in terms of the working principle were evaluated on the VitaminB12 data set obtained from the Cukurova University Balcalı Hospital Information Management System. Euclidean distances were calculated between objects and data sets were transformed into values in the range [0, 1] using normalization. Adjusted Rand index was used to evaluate the validity of clusterings. Results: According to the examined results; the applications that reveal the effects of the initial centers on the analysis process of the algorithms have been carried out with different approaches such as the working principle of the algorithm, the effect of the initial centers on the clustering results, the evaluation of the clustering performance. It was again concluded that professional selection of parameters is requirement to increase the usability of a clustering algorithm and the reliability of clustering results. The NAMGY algorithm uses a systematic way to find initial centers which reduces the number of dataset scans and will produce better accuracy in smaller number of iteration. NAMGY algorithm has proved to be better than traditional K-means algorithm in terms of good quality results and analysis processes. According to the results generated; NAMGY provides a challenging algorithm for the disadvantage of the standard K-mean algorithm. However further research is required to verify the capability of this algorithm when applied to data sets with more complex objects.
用NAMGY算法确定k -均值聚类方法的初始中心
目的:随着技术的发展和广泛使用,许多领域的数据量不断增加,加速了数字化进程。通过处理和解释数据堆栈获得的收益可以在从生产到决策过程的许多管理问题上为机构和组织做出重大贡献。它增加了数据挖掘方法在不同领域的使用,支持将数字化的大规模数据转化为信息的过程。聚类是数据挖掘中日益流行的技术之一,K-means算法是一种适用于大数据量的非分层聚类方法。该方法在科学研究中得到了广泛的应用,但是作为参数的聚类数量和初始中心的数量给算法带来了缺点,特别是对于那些不熟悉数学特性的人来说。由K-means随机生成的初始中心通常会使聚类结果达到非最优。K-means算法对初始中心非常敏感。经过多次计算,K-means聚类的结果更加一致。然而,很难确定计算极限,从而得到最优的结果。K-means算法在这方面的改进将有助于科学研究克服这一缺点。为了解决这个问题;NAMGY (Neighborhood and Midpoint Gain Yield,邻域和中点增益产率)算法是一种根据目标的特性提供最优参数选择的算法。本文介绍了确定初始中心的方法在NAMGY算法中的应用。方法:将标准K-means和NAMGY算法分别应用于Iris、Yeast和Segment-challenge分类数据集上,分析本文方法的准确性。并在库库罗娃大学balcali医院信息管理系统的维生素b12数据集上对算法的工作原理进行了评价。计算对象之间的欧氏距离,并使用归一化将数据集转换为[0,1]范围内的值。采用调整后的Rand指数评价聚类的有效性。结果:根据检查结果;从算法的工作原理、初始中心对聚类结果的影响、聚类性能的评价等方面进行了应用,揭示了初始中心对算法分析过程的影响。再次得出结论,为了提高聚类算法的可用性和聚类结果的可靠性,需要专业的参数选择。NAMGY算法采用系统的方法寻找初始中心,减少了数据集扫描次数,在较少的迭代次数下产生更好的精度。在结果质量和分析过程方面,NAMGY算法已被证明优于传统的K-means算法。根据所产生的结果;NAMGY为标准k -均值算法的不足提供了一个具有挑战性的算法。然而,当该算法应用于具有更复杂对象的数据集时,还需要进一步的研究来验证其能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信