Distributed Streaming Set Similarity Join

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI:10.1109/ICDE48307.2020.00055

Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin

{"title":"Distributed Streaming Set Similarity Join","authors":"Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00055","DOIUrl":null,"url":null,"abstract":"With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"565-576"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE48307.2020.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.

查看原文本刊更多论文

分布式流集相似度连接

随着互联网接入和用户生成内容的普及，新闻、网页等大量文档/记录以前所未有的方式不断产生。本文研究了分布式系统上的高效流集相似连接问题，该问题在数据清洗和数据集成任务(如在线近重复检测)中具有广泛的应用。与离线分布式处理中广泛采用的基于前缀的分发策略相比，我们提出了一种简单而高效的基于长度的分发框架，该框架根据记录的长度对传入记录进行调度。提出了一种负载敏感的长度分区方法，通过有效估计本地连接成本来寻找平衡分区，从而达到良好的负载平衡。我们的基于长度的方案比它的竞争对手令人惊讶地优越，因为它没有复制，通信成本小，吞吐量高。我们进一步观察到，当前传入记录的连接结果可以用来指导索引的构建，从而促进未来记录的连接处理。受此启发，我们提出了一种新颖的基于绑定的连接算法，该算法通过动态分组相似的记录来降低过滤成本。该算法的副产品是一种高效的验证技术，它通过利用令牌差异来共享验证成本来验证一批记录，而不是单独验证它们。在Storm(一个流行的分布式流处理系统)上进行的大量实验表明，我们的方法可以在基线上实现一个数量级的吞吐量改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量