A New Fast Minimum Spanning Tree-Based Clustering Technique

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI:10.1109/ICDMW.2014.139

Xiaochun Wang, X. Wang, Jihua Zhu

{"title":"A New Fast Minimum Spanning Tree-Based Clustering Technique","authors":"Xiaochun Wang, X. Wang, Jihua Zhu","doi":"10.1109/ICDMW.2014.139","DOIUrl":null,"url":null,"abstract":"Due to its important applications in data mining, many techniques have been developed for clustering. For today's real-world databases which typically have millions of items with many thousands of fields, resulting in datasets that range in size into terabytes, many traditional clustering techniques have more and more restricted capabilities and novel approaches that are computationally efficient have become more and more popular. In this paper, a new efficient approach to graph-theoretical clustering using a minimum spanning tree representation of a dataset is proposed which consists of two-phases. In the first phase, we modify the standard Prim's algorithm in such a way that an efficient construction of such a tree can be realized based on k-nearest neighbor search mechanisms, during which a new edge weight is defined to maximize the intra-cluster similarity and minimize the inter-cluster similarity of the data set. In the second phase, based on the intuition that the data points are closer in the same cluster than in different clusters, the longest edges in the minimum spanning tree obtained from the first phase are removed to form clusters as the standard minimum spanning tree-based clustering algorithms do. Experiments on synthetic as well as real data sets have been conducted to show that our proposed approach works well with respect to the state-of-the-art methods.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Data Mining Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2014.139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Due to its important applications in data mining, many techniques have been developed for clustering. For today's real-world databases which typically have millions of items with many thousands of fields, resulting in datasets that range in size into terabytes, many traditional clustering techniques have more and more restricted capabilities and novel approaches that are computationally efficient have become more and more popular. In this paper, a new efficient approach to graph-theoretical clustering using a minimum spanning tree representation of a dataset is proposed which consists of two-phases. In the first phase, we modify the standard Prim's algorithm in such a way that an efficient construction of such a tree can be realized based on k-nearest neighbor search mechanisms, during which a new edge weight is defined to maximize the intra-cluster similarity and minimize the inter-cluster similarity of the data set. In the second phase, based on the intuition that the data points are closer in the same cluster than in different clusters, the longest edges in the minimum spanning tree obtained from the first phase are removed to form clusters as the standard minimum spanning tree-based clustering algorithms do. Experiments on synthetic as well as real data sets have been conducted to show that our proposed approach works well with respect to the state-of-the-art methods.

查看原文本刊更多论文

一种新的快速最小生成树聚类技术

由于聚类在数据挖掘中的重要应用，已经开发了许多针对聚类的技术。对于当今现实世界的数据库，通常有数百万个条目和数千个字段，导致数据集的大小达到tb级，许多传统的聚类技术的能力越来越有限，而计算效率高的新方法越来越受欢迎。本文提出了一种利用最小生成树表示数据集的图理论聚类方法，该方法分为两个阶段。在第一阶段，我们修改了标准的Prim算法，使其能够基于k近邻搜索机制高效地构建这样的树，在此过程中，我们定义了一个新的边权，以最大化数据集的簇内相似度和最小化簇间相似度。在第二阶段，基于同一簇中的数据点比不同簇中的数据点更接近的直觉，从第一阶段获得的最小生成树中最长的边被移除以形成簇，就像标准的基于最小生成树的聚类算法一样。在合成和真实数据集上进行的实验表明，我们提出的方法与最先进的方法相比效果很好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Conference on Data Mining Workshop

自引率

0.00%

发文量