Parallel Hierarchical Clustering on Market Basket Data

2008 IEEE International Conference on Data Mining Workshops Pub Date : 2008-12-15 DOI:10.1109/ICDMW.2008.32

Baoying Wang, Qin Ding, Imad Rahal

引用次数: 6

Abstract

Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to decrease the computation time meanwhile keeping communication time at minimum. Load balance issue is always considered especially at data partitioning stage. Our experimental results demonstrate that PH-Clustering speeds up the sequential clustering with a great magnitude. The larger the data size, the more significant the speedup when the number of processors is large. Our results also show that the number of items has more impact on the performance of PH-Clustering than the number of transactions.

查看原文本刊更多论文

市场篮数据的并行层次聚类

数据聚类已被证明是一种很有前途的数据挖掘技术。近年来，对市场篮子数据进行了很多聚类的尝试。本文提出了一种基于MPI的市场篮子数据并行分层聚类方法(PH-Clustering)。在对主要聚类步骤进行分析的基础上，采用局部局部和局部全局的方法来减少计算时间，同时保证通信时间最小。在数据分区阶段，负载平衡问题经常被考虑。实验结果表明，PH-Clustering可以显著提高序列聚类的速度。当处理器数量较大时，数据大小越大，加速就越显著。我们的结果还表明，项目的数量比事务的数量对PH-Clustering性能的影响更大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE International Conference on Data Mining Workshops

自引率

0.00%

发文量