同步和异步流处理系统的基准测试

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD Pub Date : 2020-01-05 DOI:10.1145/3371158.3371206

V. E. Venugopal, M. Theobald

{"title":"同步和异步流处理系统的基准测试","authors":"V. E. Venugopal, M. Theobald","doi":"10.1145/3371158.3371206","DOIUrl":null,"url":null,"abstract":"With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing","PeriodicalId":360747,"journal":{"name":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Benchmarking Synchronous and Asynchronous Stream Processing Systems\",\"authors\":\"V. E. Venugopal, M. Theobald\",\"doi\":\"10.1145/3371158.3371206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing\",\"PeriodicalId\":360747,\"journal\":{\"name\":\"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD\",\"volume\":\"110 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3371158.3371206\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3371158.3371206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

随着最近大数据和物联网(IoT)应用的进步，我们观察到传感器和社交网络、广播系统、电子商务等产生的流数据的持续增长。尽管像Apache Hadoop[12]、Spark[13]、Storm[11]和Kafka[9]这样的大数据平台可以达到这个目的，但它们底层的批处理操作模式使得有必要首先将传入的数据流分成批处理，然后在这些批处理数据上同步执行给定的分析工作流。为了克服这些同步流处理架构的局限性，最近出现了异步流处理(ASP)引擎，如Apache Flink[1]、Samza[10]和Naiad[7,8]。尽管据报道，异步处理流的方式是ASP引擎性能提升的主要原因(在可持续吞吐量[4,6]和每窗口延迟方面)，但我们认为，它们与Hadoop原始设计的架构相似性仍然没有得到足够的研究。考虑到分布式计算的固有偏差(由于通信和网络延迟、调度算法、处理时间、序列化/反序列化等)，构建在主客户端架构上的平台的性能仍然经常受到隐藏的同步障碍和主客户端节点之间持续的状态交换(以及通信)需求的约束。为了理解给定节点配置可能实现的最大可持续吞吐量[5]的上限，我们使用消息传递接口(MPI)和Pthread库在c++中设计了多个硬编码的多线程进程(称为ad-hoc dataflow1)，用于两个用例，即Yahoo!流基准(YSB)[2]和Simple WindowedAggregation (SWA)，这样它们就可以根据用例的逻辑共同处理输入流。这些数据流一旦部署，就可以彼此异步通信，以100%的准确性执行特定于用例的操作。将这些轻量级临时数据流的性能与流数据处理中的主要竞争对手进行了比较

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking Synchronous and Asynchronous Stream Processing Systems

With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

自引率

0.00%

发文量