Spark Streaming在Apache YARN上的可配置和可执行模型

Int. J. Grid Util. Comput. Pub Date : 2020-02-03 DOI:10.1504/ijguc.2020.10026548

Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, E. Johnsen

{"title":"Spark Streaming在Apache YARN上的可配置和可执行模型","authors":"Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, E. Johnsen","doi":"10.1504/ijguc.2020.10026548","DOIUrl":null,"url":null,"abstract":"Streams of data are produced today at an unprecedented scale. Efficient and stable processing of these streams requires a careful interplay between the parameters of the streaming application and of the underlying stream processing framework. Today, finding these parameters happens by trial and error on the complex, deployed framework. This paper shows that high-level models can help to determine these parameters by predicting and comparing the performance of streaming applications running on stream processing frameworks with different configurations. To demonstrate this approach, this paper considers Spark Streaming, a widely used framework to leverage data streams on the fly and provide real-time stream processing. Technically, we develop a configurable and executable model to simulate both the streaming applications and the underlying Spark stream processing framework. Furthermore, we model the deployment of Spark Streaming on Apache YARN, which is a popular open-source distributed software framework for big data processing. We show that the developed model provides a satisfactory accuracy for predicting performance by means of empirical validation.","PeriodicalId":375871,"journal":{"name":"Int. J. Grid Util. Comput.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"A configurable and executable model of Spark Streaming on Apache YARN\",\"authors\":\"Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, E. Johnsen\",\"doi\":\"10.1504/ijguc.2020.10026548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Streams of data are produced today at an unprecedented scale. Efficient and stable processing of these streams requires a careful interplay between the parameters of the streaming application and of the underlying stream processing framework. Today, finding these parameters happens by trial and error on the complex, deployed framework. This paper shows that high-level models can help to determine these parameters by predicting and comparing the performance of streaming applications running on stream processing frameworks with different configurations. To demonstrate this approach, this paper considers Spark Streaming, a widely used framework to leverage data streams on the fly and provide real-time stream processing. Technically, we develop a configurable and executable model to simulate both the streaming applications and the underlying Spark stream processing framework. Furthermore, we model the deployment of Spark Streaming on Apache YARN, which is a popular open-source distributed software framework for big data processing. We show that the developed model provides a satisfactory accuracy for predicting performance by means of empirical validation.\",\"PeriodicalId\":375871,\"journal\":{\"name\":\"Int. J. Grid Util. Comput.\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Grid Util. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1504/ijguc.2020.10026548\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Grid Util. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijguc.2020.10026548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

今天，数据流以前所未有的规模产生。高效和稳定地处理这些流需要在流应用程序和底层流处理框架的参数之间进行仔细的相互作用。如今，要找到这些参数，需要在复杂的已部署框架上反复试验。本文表明，高级模型可以通过预测和比较在不同配置的流处理框架上运行的流应用程序的性能来帮助确定这些参数。为了演示这种方法，本文考虑了Spark Streaming，这是一个广泛使用的框架，用于动态利用数据流并提供实时流处理。在技术上，我们开发了一个可配置和可执行的模型来模拟流应用程序和底层Spark流处理框架。此外，我们还对Spark Streaming在Apache YARN上的部署进行了建模，Apache YARN是一个流行的大数据处理开源分布式软件框架。通过实证验证，表明所建立的模型具有较好的预测精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A configurable and executable model of Spark Streaming on Apache YARN

Streams of data are produced today at an unprecedented scale. Efficient and stable processing of these streams requires a careful interplay between the parameters of the streaming application and of the underlying stream processing framework. Today, finding these parameters happens by trial and error on the complex, deployed framework. This paper shows that high-level models can help to determine these parameters by predicting and comparing the performance of streaming applications running on stream processing frameworks with different configurations. To demonstrate this approach, this paper considers Spark Streaming, a widely used framework to leverage data streams on the fly and provide real-time stream processing. Technically, we develop a configurable and executable model to simulate both the streaming applications and the underlying Spark stream processing framework. Furthermore, we model the deployment of Spark Streaming on Apache YARN, which is a popular open-source distributed software framework for big data processing. We show that the developed model provides a satisfactory accuracy for predicting performance by means of empirical validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Grid Util. Comput.

自引率

0.00%

发文量