ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

ArXiv Pub Date : 2024-03-07 DOI:10.1145/3629526.3645036

Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, Rick Rabiser

{"title":"ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks","authors":"Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, Rick Rabiser","doi":"10.1145/3629526.3645036","DOIUrl":null,"url":null,"abstract":"Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four state-of-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"22 47","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3629526.3645036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four state-of-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.

查看原文本刊更多论文

ShuffleBench：使用分布式流处理框架的大规模数据洗牌操作基准

分布式流处理框架有助于构建可扩展的可靠应用程序，对连续数据流进行转换和聚合。本文介绍了 ShuffleBench，这是一种用于评估现代流处理框架性能的新型基准。与其他基准测试不同的是，它侧重于流处理框架主要用于洗牌（即重新分配）数据记录以执行状态本地聚合的使用案例，而实际的聚合逻辑则被视为黑盒软件组件。ShuffleBench 受大型云观测平台近实时分析需求的启发，采用了性能工程研究界确立的延迟、吞吐量和可扩展性基准指标和方法。虽然灵感来自真实世界的可观测性用例，但它具有高度可配置性，可进行独立于领域的评估。ShuffleBench 是一款即开即用的开源软件，它利用现有的 Kubernetes 工具，为四个最先进的框架提供了实现方法。因此，我们希望 ShuffleBench 能为构建流处理应用的行业从业者和研究新型流处理方法的科研人员做出有价值的贡献。作为本文的补充，我们在云原生环境中使用 ShuffleBench 对 Flink、Hazelcast、Kafka Streams 和 Spark 进行了不同配置的实验性能评估。结果表明，Flink 的吞吐量最高，而 Hazelcast 处理数据流的延迟最低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量