How Improve Set Similarity Join Based on Prefix Approach in Distributed Environment

2018 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2018-07-01 DOI:10.1109/HPCS.2018.00136

Song Zhu, Luca Gagliardelli, Giovanni Simonini, D. Beneventano

{"title":"How Improve Set Similarity Join Based on Prefix Approach in Distributed Environment","authors":"Song Zhu, Luca Gagliardelli, Giovanni Simonini, D. Beneventano","doi":"10.1109/HPCS.2018.00136","DOIUrl":null,"url":null,"abstract":"Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2018.00136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.

查看原文本刊更多论文

如何改进分布式环境下基于前缀的集相似度连接

集合相似连接是数据集成和数据分析应用程序中查找相似记录对的基本操作。为了应对不断增长的数据规模，已经提出了几种使用分布式框架(例如MapReduce)执行集合相似连接的技术。特别是，PPJoin的MapReduce实现是公开可用的，它被实验证明是最好的集相似度连接算法之一。然而，为了执行成功的并行处理，这些技术会产生大量的副本。此外，这些方法不能保证负载均衡，从而产生偏度问题，可扩展性较差。为了解决这些问题，我们提出了一种称为TTJoin的无重复技术，该技术通过利用从前缀过滤器派生的创新过滤器有效地执行集相似连接。此外，我们在Apache Spark上实现了TTJoin，这是最具创新性的分布式框架之一。在真实数据集上的几个实验证明了所提出的解决方案相对于传统的TTJoin MapReduce实现的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量