I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero
{"title":"Clustering categorical data: A stability analysis framework","authors":"I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero","doi":"10.1109/CIDM.2011.5949452","DOIUrl":null,"url":null,"abstract":"Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2011.5949452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.