Quick Execution Time Predictions for Spark Applications

2019 15th International Conference on Network and Service Management (CNSM) Pub Date : 2019-10-01 DOI:10.23919/CNSM46954.2019.9012752

Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang

{"title":"Quick Execution Time Predictions for Spark Applications","authors":"Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang","doi":"10.23919/CNSM46954.2019.9012752","DOIUrl":null,"url":null,"abstract":"The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.","PeriodicalId":273818,"journal":{"name":"2019 15th International Conference on Network and Service Management (CNSM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on Network and Service Management (CNSM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/CNSM46954.2019.9012752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.

查看原文本刊更多论文

Spark应用程序的快速执行时间预测

Apache Spark集群计算平台越来越多地用于开发大数据分析应用程序。有许多场景需要快速估计任何给定Spark应用程序的执行时间。例如，Spark集群的用户和操作人员通常需要快速了解应用程序的执行时间可能如何受到分配给应用程序的资源的影响，例如，分配的Spark执行器内核的数量，以及要处理的数据的大小。作业调度器可以从运行时的快速估计中获益，这将允许它们使用最少的资源快速配置Spark应用程序以获得所需的执行时间。虽然其他人已经开发了模型来预测Spark应用程序的执行时间，但这些模型通常需要在各种资源分配设置和数据大小下大量预先执行应用程序。因此，这些技术不适合需要快速预测的情况，并且用于构建模型所需的实验的集群资源非常少。本文提出了一种称为PERIDOT的替代方法来解决这一限制。该方法涉及在固定的资源分配设置下执行给定的应用程序，并使用其输入数据的两个不同大小的小子集。它分析这两次执行的日志，以估计应用程序内部阶段之间的依赖关系。这些依赖关系的信息结合Spark数据分区机制的知识，可以用来导出一个分析模型，该模型可以预测其他资源分配设置和输入数据大小的执行时间。我们表明，仅使用这两个参考执行来推导模型，PERIDOT就可以准确地预测各种Spark应用程序的性能，包括文本分析、线性代数、机器学习和Spark SQL。相比之下，我们表明，最先进的基于机器学习的执行时间预测算法在提供如此有限的训练数据时表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 15th International Conference on Network and Service Management (CNSM)

自引率

0.00%

发文量