Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang
{"title":"Quick Execution Time Predictions for Spark Applications","authors":"Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, Mea Wang","doi":"10.23919/CNSM46954.2019.9012752","DOIUrl":null,"url":null,"abstract":"The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.","PeriodicalId":273818,"journal":{"name":"2019 15th International Conference on Network and Service Management (CNSM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on Network and Service Management (CNSM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/CNSM46954.2019.9012752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
The Apache Spark cluster computing platform is being increasingly used to develop big data analytics applications. There are many scenarios that require quick estimates of the execution time of any given Spark application. For example, users and operators of a Spark cluster often require quick insights on how the execution time of an application is likely to be impacted by the resources allocated to the application, e.g., the number of Spark executor cores assigned, and the size of the data to be processed. Job schedulers can benefit from fast estimates at runtime that would allow them to quickly conFigure a Spark application for a desired execution time using the least amount of resources. While others have developed models to predict the execution time of Spark applications, such models typically require extensive prior executions of applications under various resource allocation settings and data sizes. Consequently, these techniques are not suited for situations where quick predictions are required and very little cluster resources are available for the experimentation needed to build a model. This paper proposes an alternative approach called PERIDOT that addresses this limitation. The approach involves executing a given application under a fixed resource allocation setting with two different-sized, small subsets of its input data. It analyzes logs from these two executions to estimate the dependencies between internal stages in the application. Information on these dependencies combined with knowledge of Spark’s data partitioning mechanisms is used to derive an analytic model that can predict execution times for other resource allocation settings and input data sizes. We show that deriving a model using just these two reference executions allows PERIDOT to accurately predict the performance of a variety of Spark applications spanning text analytics, linear algebra, machine learning and Spark SQL. In contrast, we show that a state-of-the-art machine learning based execution time prediction algorithm performs poorly when presented with such limited training data.