MassJoin: A mapreduce-based method for scalable string similarity joins

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816663

Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng

{"title":"MassJoin: A mapreduce-based method for scalable string similarity joins","authors":"Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng","doi":"10.1109/ICDE.2014.6816663","DOIUrl":null,"url":null,"abstract":"String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"112","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 112

Abstract

String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.

查看原文本刊更多论文

MassJoin:一个基于mapreduce的方法，用于可伸缩的字符串相似连接

字符串相似连接是数据集成中的一项重要操作。大数据时代需要可扩展的算法来支持大规模的字符串相似连接。在本文中，我们使用MapReduce来研究可扩展的字符串相似连接。我们提出了一个基于mapreduce的框架，称为MASSJOIN，它既支持基于集合的相似函数，也支持基于字符的相似函数。我们扩展了现有的基于分区的签名方案以支持基于集合的相似函数。我们利用签名来生成键值对。为了降低传输成本，我们合并键值对，在不牺牲剪枝能力的前提下，显著减少键值对的数量，从三次复杂度到线性复杂度。为了提高性能，我们在键值对中加入了“轻量级”滤波单元，可以在不显著增加传输成本的情况下，对大量的不相似对进行裁剪。在真实数据集上的实验结果表明，我们的方法明显优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量