{"title":"一种选择分类数据聚类初始集的多聚类方法","authors":"Carlos Santos-Mangudo, Antonio J. Heras","doi":"10.28945/4643","DOIUrl":null,"url":null,"abstract":"Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data 228 Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical Kmodes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.","PeriodicalId":38962,"journal":{"name":"Interdisciplinary Journal of Information, Knowledge, and Management","volume":"15 1","pages":"227-246"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data\",\"authors\":\"Carlos Santos-Mangudo, Antonio J. Heras\",\"doi\":\"10.28945/4643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data 228 Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical Kmodes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.\",\"PeriodicalId\":38962,\"journal\":{\"name\":\"Interdisciplinary Journal of Information, Knowledge, and Management\",\"volume\":\"15 1\",\"pages\":\"227-246\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interdisciplinary Journal of Information, Knowledge, and Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.28945/4643\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interdisciplinary Journal of Information, Knowledge, and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28945/4643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data
Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data 228 Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical Kmodes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.