A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems

Kejian Li, Gang Liu, Minhua Lu
{"title":"A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems","authors":"Kejian Li, Gang Liu, Minhua Lu","doi":"10.1109/PDCAT46702.2019.00046","DOIUrl":null,"url":null,"abstract":"The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.","PeriodicalId":166126,"journal":{"name":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT46702.2019.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.
分布式流处理系统的整体流划分算法
现代分布式流处理系统的性能受到工作负载分布的严重影响。在现实世界中,倾斜的数据流非常常见,并且对这些系统构成了巨大的挑战,特别是对于有状态的应用程序。密钥分割允许将单个密钥路由到多个工作者,这是在集群中实现良好负载平衡的好主意。然而,它带来的代价是内存消耗和计算开销以及网络通信的增加。在本文中,我们提出了一种新的度量定义来模拟流处理系统中操作符内并行性的键分割成本,并提供了一种新的视角来减少复制因子,同时保持总体负载不平衡和处理延迟低。与之前的工作类似,我们的方法对分布的头部和尾部进行了不同的处理,以减少内存需求。对于头部,它使用我们提出的区域负载不平衡的概念来动态决定是否让更多的工人负责重磅打击。对于尾部,它只是使用散列分区来保持头部路由表的大小尽可能小。广泛的实验评估表明,与最先进的分区算法相比,我们的方法在负载不平衡、复制因子和不同级别的倾斜流分布的延迟方面提供了优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信