Stream-aware indexing for distributed inequality join processing

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2024-07-06 DOI:10.1016/j.is.2024.102425

Adeel Aslam, Giovanni Simonini, Luca Gagliardelli, Luca Zecchini, Sonia Bergamaschi

{"title":"Stream-aware indexing for distributed inequality join processing","authors":"Adeel Aslam, Giovanni Simonini, Luca Gagliardelli, Luca Zecchini, Sonia Bergamaschi","doi":"10.1016/j.is.2024.102425","DOIUrl":null,"url":null,"abstract":"<div>Inequality join is an operator to join data on inequality conditions and it is a fundamental building block for applications. While methods and optimizations exist for efficient inequality join in batch processing, little attention has been given to its streaming version, particularly to large-scale data-intensive applications that run on Distributed Stream Processing Systems (DSPSs). Designing an inequality join in streaming and distributed settings is not an easy task: (i) indexes have to be employed to efficiently support inequality-based comparisons, but the continuous stream of data imposes continuous insertions, updates, and deletions of elements in the indexes—hence a huge overhead for the DSPSs; (ii) oftentimes real data is skewed, which makes indexing even more challenging.To address these challenges, we propose the Stream-Aware inequality join (STA), an indexing method that can reduce redundancy and index update overhead. STA builds a separate in-memory index structure for hotkeys, i.e., the most frequently used keys, which are automatically identified with an efficient data sketch. On the other hand, the cold keys are treated using a linked set of index structures. In this way, STA avoids many superfluous index updates for frequent items. Finally, we implement four state-of-the-art inequality join solutions for a widely employed DSPS (Apache Storm) and compare their performance with STA on four real-world data sets and a synthetic one. The results of our experimental evaluation reveal that our stream-aware approach outperforms existing solutions.</div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102425"},"PeriodicalIF":3.4000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924000838","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Inequality join is an operator to join data on inequality conditions and it is a fundamental building block for applications. While methods and optimizations exist for efficient inequality join in batch processing, little attention has been given to its streaming version, particularly to large-scale data-intensive applications that run on Distributed Stream Processing Systems (DSPSs). Designing an inequality join in streaming and distributed settings is not an easy task: (i) indexes have to be employed to efficiently support inequality-based comparisons, but the continuous stream of data imposes continuous insertions, updates, and deletions of elements in the indexes—hence a huge overhead for the DSPSs; (ii) oftentimes real data is skewed, which makes indexing even more challenging.

To address these challenges, we propose the Stream-Aware inequality join (STA), an indexing method that can reduce redundancy and index update overhead. STA builds a separate in-memory index structure for hotkeys, i.e., the most frequently used keys, which are automatically identified with an efficient data sketch. On the other hand, the cold keys are treated using a linked set of index structures. In this way, STA avoids many superfluous index updates for frequent items. Finally, we implement four state-of-the-art inequality join solutions for a widely employed DSPS (Apache Storm) and compare their performance with STA on four real-world data sets and a synthetic one. The results of our experimental evaluation reveal that our stream-aware approach outperforms existing solutions.

查看原文本刊更多论文

分布式不等式连接处理的流感知索引

不等式连接是一种根据不等式条件连接数据的操作符，是应用程序的基本构件。虽然在批处理中存在高效不等式连接的方法和优化措施，但很少有人关注其流式版本，尤其是在分布式流处理系统（DSPS）上运行的大规模数据密集型应用。在流式和分布式环境中设计不等式连接并非易事：(i) 必须使用索引来有效支持基于不等式的比较，但连续的数据流会不断插入、更新和删除索引中的元素，因此会给 DSPS 带来巨大的开销；(ii) 有时真实数据是倾斜的，这使得索引编制更具挑战性。STA 为热键（即最常用的键，可通过高效的数据草图自动识别）建立单独的内存索引结构。另一方面，冷键则使用一组链接的索引结构来处理。通过这种方式，STA 可以避免对频繁项目进行许多多余的索引更新。最后，我们为广泛使用的 DSPS（Apache Storm）实施了四种最先进的不等式连接解决方案，并在四个真实数据集和一个合成数据集上比较了它们与 STA 的性能。实验评估结果表明，我们的流感知方法优于现有解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.