Modeling Distributed Stream Processing Systems Under Heavy Workload

2019 International Conference on Cyberworlds (CW) Pub Date : 2019-10-01 DOI:10.1109/CW.2019.00024

Muhammad Mudassar Qureshi, Hanhua Chen, Hai Jin

{"title":"Modeling Distributed Stream Processing Systems Under Heavy Workload","authors":"Muhammad Mudassar Qureshi, Hanhua Chen, Hai Jin","doi":"10.1109/CW.2019.00024","DOIUrl":null,"url":null,"abstract":"Big data applications play a significant role in diverse fields. Distributed Stream Processing Engines (DSPEs) are widely used to support real time applications efficiently. Partitioning algorithms are used to partition data streams into multiple nodes to process in parallel to gain efficient performance. Aggregation cost is an important factor when process stateful streaming applications using such partitioning algorithms because it plays an important role on performance when final result is being produced in stateful streaming applications. However, impact of aggregation cost in stream processing is not discussed comprehensively in existing literature. We use performance modeling to identify the importance of aggregation cost when workload is high. We implement performance model on a multi-node cluster to predict the same behavior as on single resource performance model. We demonstrate that stateful streaming applications need more resources as compare to stateless applications when workload is high and both stateful and stateless applications are running in the same DSPE. Experiments results show that a stateful streaming application needs more resources compared to a stateless streaming application when both applications are running on the same DSPE when the workload is high. Further experiment results show that the performance modeling may be helpful to predict maximum workload that can be process on a DSPE and increase in parallelism level is not guaranteed to increase the performance of streaming applications.","PeriodicalId":117409,"journal":{"name":"2019 International Conference on Cyberworlds (CW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Cyberworlds (CW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CW.2019.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Big data applications play a significant role in diverse fields. Distributed Stream Processing Engines (DSPEs) are widely used to support real time applications efficiently. Partitioning algorithms are used to partition data streams into multiple nodes to process in parallel to gain efficient performance. Aggregation cost is an important factor when process stateful streaming applications using such partitioning algorithms because it plays an important role on performance when final result is being produced in stateful streaming applications. However, impact of aggregation cost in stream processing is not discussed comprehensively in existing literature. We use performance modeling to identify the importance of aggregation cost when workload is high. We implement performance model on a multi-node cluster to predict the same behavior as on single resource performance model. We demonstrate that stateful streaming applications need more resources as compare to stateless applications when workload is high and both stateful and stateless applications are running in the same DSPE. Experiments results show that a stateful streaming application needs more resources compared to a stateless streaming application when both applications are running on the same DSPE when the workload is high. Further experiment results show that the performance modeling may be helpful to predict maximum workload that can be process on a DSPE and increase in parallelism level is not guaranteed to increase the performance of streaming applications.

查看原文本刊更多论文

大负载下分布式流处理系统建模

大数据应用在各个领域发挥着重要作用。分布式流处理引擎(Distributed Stream Processing engine, dspe)被广泛用于高效地支持实时应用。采用分区算法将数据流划分到多个节点进行并行处理，以获得高效的性能。在使用这种分区算法处理有状态流应用程序时，聚合成本是一个重要因素，因为在有状态流应用程序中产生最终结果时，聚合成本对性能起着重要作用。然而，现有文献并未对聚合成本对流处理的影响进行全面的讨论。我们使用性能建模来确定工作负载高时聚合成本的重要性。我们在多节点集群上实现性能模型，以预测与单资源性能模型相同的行为。我们演示了当工作负载高且有状态和无状态应用程序都在同一个DSPE中运行时，有状态流应用程序比无状态流应用程序需要更多的资源。实验结果表明，当负载较高时，有状态流应用程序比无状态流应用程序需要更多的资源。进一步的实验结果表明，性能建模可能有助于预测在DSPE上可以处理的最大工作负载，而并行性级别的提高并不能保证提高流应用程序的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Cyberworlds (CW)

自引率

0.00%

发文量