A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data

Q2 Computer Science

Interdisciplinary Journal of Information, Knowledge, and Management Pub Date : 2020-10-04 DOI:10.28945/4643

Carlos Santos-Mangudo, Antonio J. Heras

{"title":"A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data","authors":"Carlos Santos-Mangudo, Antonio J. Heras","doi":"10.28945/4643","DOIUrl":null,"url":null,"abstract":"Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data 228 Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical Kmodes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.","PeriodicalId":38962,"journal":{"name":"Interdisciplinary Journal of Information, Knowledge, and Management","volume":"15 1","pages":"227-246"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interdisciplinary Journal of Information, Knowledge, and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28945/4643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data 228 Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical Kmodes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.

查看原文本刊更多论文

一种选择分类数据聚类初始集的多聚类方法

目的本文提出了一种选择分类数据初始集的方法。其主要思想是将每个标准或属性的所有不同值组合起来，形成所谓的多聚类的第一个建议，通过这种方式获得整个数据集的最大聚类数量。由此获得的多簇本身在第二步骤中根据期望的簇的最终数量进行聚类。背景常用的分类数据聚类方法，如众所周知的K-模式，通常通过一些随机过程来选择初始集。这一事实在算法的最终结果中引入了一些随机性。我们探索了分类数据聚类方法的不同应用，该方法克服了不稳定性问题，并最终提供了更高的聚类效率。方法为了评估所提出的算法的性能及其与K-模式的比较，我们将两者都应用于分类数据库，其中响应变量是已知的，但在分析中没有使用。在我们的例子中，该响应变量可以被识别为观测所属的真实集群或类。对于每个数据集，我们执行两步分析。在第一步中，我们对省略了响应变量（真实聚类）的数据进行聚类分析，在第二步中，使用省略的信息来检查聚类算法的效率（通过将真实聚类与算法给出的聚类进行比较）。贡献简单、高效和稳定是多集群方法的主要优点。一种选择初始集进行分类数据聚类的多聚类方法228发现在实际数据库中获得的实验结果表明，多聚类算法比经典的Kmodes算法具有更高的精度和更好的分组效果。对从业者的建议该方法对那些使用中小型数据集的研究人员很有用，使他们能够以直观合理的方式检测数据的底层结构。对研究人员的建议所提出的算法比K-Modes慢，因为它花费了大量时间来计算属性的初始组合。因此，减少计算时间是一个重要的研究课题。未来研究我们关注算法对大型复杂数据集的可扩展性，以及对具有定量和定性属性的混合数据集的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interdisciplinary Journal of Information, Knowledge, and Management Computer Science-Computer Science (all)

CiteScore

2.30

自引率

0.00%

发文量