Scaling Out Schema-free Stream Joins

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI:10.1109/ICDE48307.2020.00075

Damjan Gjurovski, S. Michel

引用次数: 0

Abstract

In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-value pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree–based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real-world as well as synthetic data, we evaluate the proposed algorithms and show that they outperform competing approaches.

查看原文本刊更多论文

扩展无模式流连接

在这项工作中，我们考虑在不遵循特定模式的大量JSON文档流上计算自然连接。我们首先提出了一种高效且可扩展的分区算法，该算法使用关联分析的主要原则来识别文档中属性-值对共现的模式。然后，数据相应地转发到计算节点，并使用一种新的基于fp树的连接算法进行本地连接。通过紧凑地存储文档和有效地遍历fp -树结构，所提出的连接算法可以在大的输入大小上操作并实时提供结果。我们讨论了无模式数据的自然连接所固有的与数据相关的可伸缩性限制，并展示了如何通过人为地扩展可能的属性值对的空间来实际规避这些限制。该算法在Apache Storm流处理框架中实现。通过对真实世界和合成数据的广泛实验，我们评估了所提出的算法，并表明它们优于竞争方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量