A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2016-10-01 DOI:10.1109/DSAA.2016.42

Poonam Goyal, S. Kumari, Shubham Singh, V. Kishore, S. Balasubramaniam, Navneet Goyal

{"title":"A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering","authors":"Poonam Goyal, S. Kumari, Shubham Singh, V. Kishore, S. Balasubramaniam, Navneet Goyal","doi":"10.1109/DSAA.2016.42","DOIUrl":null,"url":null,"abstract":"Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task as points in high dimensional space are nearly equidistant from each other, rendering commonly used similarity measures ineffective. Subspace clustering has emerged as a possible solution to the problem of clustering high-dimensional data. In subspace clustering, we try to find clusters in different subspaces within a dataset. Many subspace clustering algorithms have been proposed in the last two decades to find clusters in multiple overlapping subspaces of high-dimensional data. Subspace clustering algorithms iteratively find the best subset of dimensions for a cluster from 2d–1 possible combinations in d-dimensional data. Subspace clustering is extremely compute intensive because of exhaustive search of subspaces, especially in the bottom-up subspace clustering algorithms. To address this issue, an efficient parallel framework for grid-based bottom-up subspace clustering algorithms is developed, considering popular algorithms belonging to this category. The framework is implemented for shared memory, distributed memory, and hybrid systems and is tested for three grid-based bottom-up subspace clustering algorithms: CLIQUE, MAFIA, and ENCLUS. All parallel implementations exhibit impressive speedup and scalability on real datasets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2016.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task as points in high dimensional space are nearly equidistant from each other, rendering commonly used similarity measures ineffective. Subspace clustering has emerged as a possible solution to the problem of clustering high-dimensional data. In subspace clustering, we try to find clusters in different subspaces within a dataset. Many subspace clustering algorithms have been proposed in the last two decades to find clusters in multiple overlapping subspaces of high-dimensional data. Subspace clustering algorithms iteratively find the best subset of dimensions for a cluster from 2d–1 possible combinations in d-dimensional data. Subspace clustering is extremely compute intensive because of exhaustive search of subspaces, especially in the bottom-up subspace clustering algorithms. To address this issue, an efficient parallel framework for grid-based bottom-up subspace clustering algorithms is developed, considering popular algorithms belonging to this category. The framework is implemented for shared memory, distributed memory, and hybrid systems and is tested for three grid-based bottom-up subspace clustering algorithms: CLIQUE, MAFIA, and ENCLUS. All parallel implementations exhibit impressive speedup and scalability on real datasets.

查看原文本刊更多论文

基于网格的自底向上子空间聚类并行框架

聚类是一种流行的数据挖掘和机器学习技术，它通过将相似的对象分组在一起，从未标记的数据中发现有趣的模式。聚类高维数据是一项具有挑战性的任务，因为高维空间中的点彼此之间的距离几乎相等，使得常用的相似性度量无效。子空间聚类是解决高维数据聚类问题的一种可能的方法。在子空间聚类中，我们试图在数据集中的不同子空间中找到聚类。在过去的二十年里，人们提出了许多子空间聚类算法来在高维数据的多个重叠子空间中寻找聚类。子空间聚类算法从d维数据的2d-1可能组合中迭代地找到聚类的最佳维度子集。由于子空间的穷举搜索，特别是自底向上的子空间聚类算法，子空间聚类的计算量非常大。为了解决这一问题，考虑到这类常用算法，开发了一种基于网格的自下而上子空间聚类算法的高效并行框架。该框架适用于共享内存、分布式内存和混合系统，并测试了三种基于网格的自下而上子空间聚类算法:CLIQUE、MAFIA和ENCLUS。所有并行实现在实际数据集上都表现出令人印象深刻的加速和可伸缩性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量