{"title":"A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems","authors":"Kejian Li, Gang Liu, Minhua Lu","doi":"10.1109/PDCAT46702.2019.00046","DOIUrl":null,"url":null,"abstract":"The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.","PeriodicalId":166126,"journal":{"name":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT46702.2019.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The performances of modern distributed stream processing systems are critically affected by the distribution of the load across workers. Skewed data streams in real world are very common and pose a great challenge to these systems, especially for stateful applications. Key splitting, which allows a single key to be routed to multiple workers, is a great idea to achieve good balance of load in the cluster. However, it comes with the cost of increased memory consumption and computation overhead as well as network communication. In this paper, we present a new definition of metric to model the cost of key splitting for intra-operator parallelism in stream processing systems and provide a novel perspective to reduce replication factor while keeping both overall load imbalance and processing latency low. Similar to previous work, our approach treats the head and the tail of the distribution differently in order to reduce memory requirements. For the head, it uses our proposed notion of regional load imbalance to decide dynamically whether to make one more worker responsible for the heavy hitter or not. For the tail, it simply uses hash partitioning to keep the size of the routing table for the head as small as possible. Extensive experimental evaluation demonstrates that our approach provides superior performance compared to the state-of-the-art partitioning algorithms in terms of load imbalance, replication factor and latency over different levels of skewed stream distributions.