Parallel Top-K Similarity Join Algorithms Using MapReduce

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.87

Younghoon Kim, Kyuseok Shim

引用次数: 87

Abstract

There is a wide range of applications that require finding the top-k most similar pairs of records in a given database. However, computing such top-k similarity joins is a challenging problem today, as there is an increasing trend of applications that expect to deal with vast amounts of data. For such data-intensive applications, parallel executions of programs on a large cluster of commodity machines using the MapReduce paradigm have recently received a lot of attention. In this paper, we investigate how the top-k similarity join algorithms can get benefits from the popular MapReduce framework. We first develop the divide-and-conquer and branch-and-bound algorithms. We next propose the all pair partitioning and essential pair partitioning methods to minimize the amount of data transfers between map and reduce functions. We finally perform the experiments with not only synthetic but also real-life data sets. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.

查看原文本刊更多论文

基于MapReduce的并行Top-K相似度连接算法

有很多应用程序需要在给定的数据库中查找top-k最相似的记录对。然而，计算这种top-k相似性连接在今天是一个具有挑战性的问题，因为期望处理大量数据的应用程序越来越多。对于这样的数据密集型应用程序，使用MapReduce范式在大型商用机器集群上并行执行程序最近受到了很多关注。在本文中，我们研究了top-k相似度连接算法如何从流行的MapReduce框架中获益。我们首先发展了分治算法和分支定界算法。接下来，我们提出了全对划分和基本对划分方法，以最小化映射函数和约简函数之间的数据传输量。最后，我们不仅使用合成数据集，还使用真实数据集进行实验。我们的性能研究证实了MapReduce算法的有效性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量