{"title":"同步和异步流处理系统的基准测试","authors":"V. E. Venugopal, M. Theobald","doi":"10.1145/3371158.3371206","DOIUrl":null,"url":null,"abstract":"With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing","PeriodicalId":360747,"journal":{"name":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Benchmarking Synchronous and Asynchronous Stream Processing Systems\",\"authors\":\"V. E. Venugopal, M. Theobald\",\"doi\":\"10.1145/3371158.3371206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing\",\"PeriodicalId\":360747,\"journal\":{\"name\":\"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD\",\"volume\":\"110 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3371158.3371206\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3371158.3371206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Benchmarking Synchronous and Asynchronous Stream Processing Systems
With the recent advancements in Big Data and Internet-of-Things (IoT) applications, we observe a continued growth in the generation of streaming data produced by sensor and social networks, broadcasting systems, e-commerce, and many others. Even though Big Data platforms such as Apache Hadoop [12], Spark [13], Storm [11] and Kafka [9] would serve the purpose, their underlying batch mode of operation makes it necessary to first split the incoming data streams into batches, and to then synchronously execute a given analytical workflow over these data batches. To overcome the limitations of these synchronous stream-processing architectures, asynchronous stream-processing (ASP) engines such as such as Apache Flink [1], Samza [10] and Naiad [7, 8] have recently emerged. Although the asynchronous way of handling streams is reported to be the prime reason for the performance gains (in terms of sustainable throughput [4, 6] and per-window latencies) of ASP engines, we believe that their architectural similarity with the original design of Hadoop still is not critically enough investigated. Given the inherent deviances of distributed computations (due to communication and network delays, scheduling algorithms, time spent on processing, serialization/deserialization, etc.), the performance of platforms built on a master-client architecture still is often bound by hidden synchronization barriers and the constant need of state exchange (and hence communication) between the master and the client nodes. To understand the upper bound of the maximum sustainable throughput [5] that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows1) in C++ using Message Passing Interface (MPI) and Pthread libraries, for two use-cases, namely Yahoo! streaming benchmark (YSB) [2] and Simple WindowedAggregation (SWA), such that they could collectively process an input stream based on the logic of the use-case. These dataflows once deployed could asynchronously communicate with each other to perform the use-case specific operations with 100% accuracy. The performance of these light-weight ad-hoc dataflows is compared against the main competitors among the stream data processing