Clustering categorical data: A stability analysis framework

2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) Pub Date : 2011-04-11 DOI:10.1109/CIDM.2011.5949452

I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero

{"title":"Clustering categorical data: A stability analysis framework","authors":"I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero","doi":"10.1109/CIDM.2011.5949452","DOIUrl":null,"url":null,"abstract":"Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2011.5949452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.

查看原文本刊更多论文

聚类分类数据:一个稳定性分析框架

聚类识别内在结构是数据探索中重要的第一步。k-means算法是一种流行的选择，但k-means通常不适用于分类数据。k-means对分类数据的具体扩展是k-modes算法。这两种划分聚类方法都对原型的初始化很敏感，这给给定问题选择最佳解决方案带来了困难。此外，选择集群的数量也是一个问题。此外，k模态方法在处理“噪声”数据时特别容易出现不稳定性，因为模态的计算缺乏平均值计算中固有的平滑效果。现实世界的数据集经常出现这种情况，例如在公共卫生领域，这导致解决方案可能因初始化而截然不同，从而导致不同的解释。本文提出了两种方法。第一个解决了使用k-mode解的通用横向映射对初始化的敏感性。第二种方法利用景观图来稳定离散数据的分区簇，通过绘制共识样本来分离信号和噪声成分。介绍了基准大豆疾病数据集、人工生成数据集和涉及公共卫生数据的案例研究的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

自引率

0.00%

发文量