Stream-aware indexing for distributed inequality join processing

IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Adeel Aslam, Giovanni Simonini, Luca Gagliardelli, Luca Zecchini, Sonia Bergamaschi
{"title":"Stream-aware indexing for distributed inequality join processing","authors":"Adeel Aslam,&nbsp;Giovanni Simonini,&nbsp;Luca Gagliardelli,&nbsp;Luca Zecchini,&nbsp;Sonia Bergamaschi","doi":"10.1016/j.is.2024.102425","DOIUrl":null,"url":null,"abstract":"<div><p>Inequality join is an operator to join data on inequality conditions and it is a fundamental building block for applications. While methods and optimizations exist for efficient inequality join in batch processing, little attention has been given to its streaming version, particularly to large-scale data-intensive applications that run on <em>Distributed Stream Processing Systems</em> (DSPSs). Designing an inequality join in streaming and distributed settings is not an easy task: <em>(i)</em> indexes have to be employed to efficiently support inequality-based comparisons, but the continuous stream of data imposes continuous insertions, updates, and deletions of elements in the indexes—hence a huge overhead for the DSPSs; <em>(ii)</em> oftentimes real data is skewed, which makes indexing even more challenging.</p><p>To address these challenges, we propose the <em>Stream-Aware inequality join</em> (STA), an indexing method that can reduce redundancy and index update overhead. STA builds a separate in-memory index structure for hotkeys, i.e., the most frequently used keys, which are automatically identified with an efficient data sketch. On the other hand, the cold keys are treated using a linked set of index structures. In this way, STA avoids many superfluous index updates for frequent items. Finally, we implement four state-of-the-art inequality join solutions for a widely employed DSPS (Apache Storm) and compare their performance with STA on four real-world data sets and a synthetic one. The results of our experimental evaluation reveal that our stream-aware approach outperforms existing solutions.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924000838","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Inequality join is an operator to join data on inequality conditions and it is a fundamental building block for applications. While methods and optimizations exist for efficient inequality join in batch processing, little attention has been given to its streaming version, particularly to large-scale data-intensive applications that run on Distributed Stream Processing Systems (DSPSs). Designing an inequality join in streaming and distributed settings is not an easy task: (i) indexes have to be employed to efficiently support inequality-based comparisons, but the continuous stream of data imposes continuous insertions, updates, and deletions of elements in the indexes—hence a huge overhead for the DSPSs; (ii) oftentimes real data is skewed, which makes indexing even more challenging.

To address these challenges, we propose the Stream-Aware inequality join (STA), an indexing method that can reduce redundancy and index update overhead. STA builds a separate in-memory index structure for hotkeys, i.e., the most frequently used keys, which are automatically identified with an efficient data sketch. On the other hand, the cold keys are treated using a linked set of index structures. In this way, STA avoids many superfluous index updates for frequent items. Finally, we implement four state-of-the-art inequality join solutions for a widely employed DSPS (Apache Storm) and compare their performance with STA on four real-world data sets and a synthetic one. The results of our experimental evaluation reveal that our stream-aware approach outperforms existing solutions.

分布式不等式连接处理的流感知索引
不等式连接是一种根据不等式条件连接数据的操作符,是应用程序的基本构件。虽然在批处理中存在高效不等式连接的方法和优化措施,但很少有人关注其流式版本,尤其是在分布式流处理系统(DSPS)上运行的大规模数据密集型应用。在流式和分布式环境中设计不等式连接并非易事:(i) 必须使用索引来有效支持基于不等式的比较,但连续的数据流会不断插入、更新和删除索引中的元素,因此会给 DSPS 带来巨大的开销;(ii) 有时真实数据是倾斜的,这使得索引编制更具挑战性。STA 为热键(即最常用的键,可通过高效的数据草图自动识别)建立单独的内存索引结构。另一方面,冷键则使用一组链接的索引结构来处理。通过这种方式,STA 可以避免对频繁项目进行许多多余的索引更新。最后,我们为广泛使用的 DSPS(Apache Storm)实施了四种最先进的不等式连接解决方案,并在四个真实数据集和一个合成数据集上比较了它们与 STA 的性能。实验评估结果表明,我们的流感知方法优于现有解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Systems
Information Systems 工程技术-计算机:信息系统
CiteScore
9.40
自引率
2.70%
发文量
112
审稿时长
53 days
期刊介绍: Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信