Approximate Clustering Ensemble Method for Big Data

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2023-03-10 DOI:10.1109/TBDATA.2023.3255003

Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu

{"title":"Approximate Clustering Ensemble Method for Big Data","authors":"Mohammad Sultan Mahmud;Joshua Zhexue Huang;Rukhsana Ruby;Alladoumbaye Ngueilbaye;Kaishun Wu","doi":"10.1109/TBDATA.2023.3255003","DOIUrl":null,"url":null,"abstract":"Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n cluster centers. Finally, the \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n-means algorithm was used to allocate the entire dataset into \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"9 4","pages":"1142-1155"},"PeriodicalIF":7.5000,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10066202/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of

$k$

cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of

$k$

cluster centers. Finally, the

$k$

-means algorithm was used to allocate the entire dataset into

$k$

clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.

查看原文本刊更多论文

大数据的近似聚类集成方法

在分布式计算中，对数百GB或更多的大型分布式数据集进行聚类是一项具有挑战性的任务。解决这个问题的一种流行方法是使用大数据集的随机样本来计算近似结果，作为从整个数据集计算的真实结果的估计。在本文中，我们使用多个随机样本来计算集合结果，而不是使用单个随机样本，作为对大数据集真实结果的估计。我们提出了一个分布式计算框架来计算集成结果。在这个框架中，大数据集在RSP数据模型中表示为分布式文件系统中管理的随机样本数据块。为了计算集合聚类结果，随机选择一组RSP数据块作为随机样本，并在集群的节点上并行独立聚类，以生成分量聚类结果。分量结果被传输到主节点，主节点计算集合结果。由于随机样本是不相交的，并且不能使用传统的一致性函数，我们提出了两种新的方法来将分量聚类结果集成到最终的集成结果中。第一种方法使用组件聚类中心来构建图，并使用METIS算法将图切割成子图，从子图中找到一组候选聚类中心。然后使用分层聚类方法来生成最终的$k$k聚类中心集合。第二种方法使用通过传递消息进行聚类的方法来生成$k$k个聚类中心的最终集合。最后，使用$k$k-means算法将整个数据集分配到$k$k个聚类中。实验在合成数据集和真实世界数据集上进行。结果表明，新的集成聚类方法比比较方法性能更好，并且分布式计算框架在对大数据集进行聚类时是高效和可扩展的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.