Scaling Out Schema-free Stream Joins

Damjan Gjurovski, S. Michel
{"title":"Scaling Out Schema-free Stream Joins","authors":"Damjan Gjurovski, S. Michel","doi":"10.1109/ICDE48307.2020.00075","DOIUrl":null,"url":null,"abstract":"In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-value pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree–based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real-world as well as synthetic data, we evaluate the proposed algorithms and show that they outperform competing approaches.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"28 1","pages":"805-816"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE48307.2020.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-value pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree–based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real-world as well as synthetic data, we evaluate the proposed algorithms and show that they outperform competing approaches.
扩展无模式流连接
在这项工作中,我们考虑在不遵循特定模式的大量JSON文档流上计算自然连接。我们首先提出了一种高效且可扩展的分区算法,该算法使用关联分析的主要原则来识别文档中属性-值对共现的模式。然后,数据相应地转发到计算节点,并使用一种新的基于fp树的连接算法进行本地连接。通过紧凑地存储文档和有效地遍历fp -树结构,所提出的连接算法可以在大的输入大小上操作并实时提供结果。我们讨论了无模式数据的自然连接所固有的与数据相关的可伸缩性限制,并展示了如何通过人为地扩展可能的属性值对的空间来实际规避这些限制。该算法在Apache Storm流处理框架中实现。通过对真实世界和合成数据的广泛实验,我们评估了所提出的算法,并表明它们优于竞争方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信