A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Pub Date : 2019-12-01 DOI:10.1109/PDCAT46702.2019.00046

Kejian Li, Gang Liu, Minhua Lu

{"title":"A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems","authors":"Kejian Li, Gang Liu, Minhua Lu","doi":"10.1109/PDCAT46702.2019.00046","DOIUrl":null,"url":null,"abstract":"The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.","PeriodicalId":166126,"journal":{"name":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT46702.2019.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.

查看原文本刊更多论文

分布式流处理系统的整体流划分算法

现代分布式流处理系统的性能受到工作负载分布的严重影响。在现实世界中，倾斜的数据流非常常见，并且对这些系统构成了巨大的挑战，特别是对于有状态的应用程序。密钥分割允许将单个密钥路由到多个工作者，这是在集群中实现良好负载平衡的好主意。然而，它带来的代价是内存消耗和计算开销以及网络通信的增加。在本文中，我们提出了一种新的度量定义来模拟流处理系统中操作符内并行性的键分割成本，并提供了一种新的视角来减少复制因子，同时保持总体负载不平衡和处理延迟低。与之前的工作类似，我们的方法对分布的头部和尾部进行了不同的处理，以减少内存需求。对于头部，它使用我们提出的区域负载不平衡的概念来动态决定是否让更多的工人负责重磅打击。对于尾部，它只是使用散列分区来保持头部路由表的大小尽可能小。广泛的实验评估表明，与最先进的分区算法相比，我们的方法在负载不平衡、复制因子和不同级别的倾斜流分布的延迟方面提供了优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)

自引率

0.00%

发文量