Guangjun Ye, Wuji Liu, C. Wu, Wei Shen, Xukang Lyu
{"title":"On Machine Learning-based Stage-aware Performance Prediction of Spark Applications","authors":"Guangjun Ye, Wuji Liu, C. Wu, Wei Shen, Xukang Lyu","doi":"10.1109/IPCCC50635.2020.9391564","DOIUrl":null,"url":null,"abstract":"The data volume of large-scale applications in various science, engineering, and business domains has experienced an explosive growth over the past decade, and has gone far beyond the computing capability and storage capacity of any single server. As a viable solution, such data is oftentimes stored in distributed file systems and processed by parallel computing engines, as exemplified by Spark, which has gained increasing popularity over the traditional MapReduce framework due to its fast in-memory processing of streaming data. Spark engines are generally deployed in cloud environments such as Amazon EC2 and Alibaba Cloud. However, storage and computing resources in these cloud environments are typically provisioned on a pay-as-you-go basis and thus an accurate estimate of the execution time of Spark workloads is critical to making full utilization of cloud resources and meeting performance requirements of end users. Our insight is that the execution pattern of many Spark workloads is qualitatively similar, which makes it possible to leverage historical performance data to predict the execution time of a given Spark application. We use the execution information extracted from Spark History Server as training data and develop a stage-aware hierarchical neural network model for performance prediction. Experimental results show that the proposed hierarchical model achieves higher accuracy than a holistic prediction model at the end-to-end level, and also outperforms other existing regression-based prediction methods.","PeriodicalId":226034,"journal":{"name":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPCCC50635.2020.9391564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The data volume of large-scale applications in various science, engineering, and business domains has experienced an explosive growth over the past decade, and has gone far beyond the computing capability and storage capacity of any single server. As a viable solution, such data is oftentimes stored in distributed file systems and processed by parallel computing engines, as exemplified by Spark, which has gained increasing popularity over the traditional MapReduce framework due to its fast in-memory processing of streaming data. Spark engines are generally deployed in cloud environments such as Amazon EC2 and Alibaba Cloud. However, storage and computing resources in these cloud environments are typically provisioned on a pay-as-you-go basis and thus an accurate estimate of the execution time of Spark workloads is critical to making full utilization of cloud resources and meeting performance requirements of end users. Our insight is that the execution pattern of many Spark workloads is qualitatively similar, which makes it possible to leverage historical performance data to predict the execution time of a given Spark application. We use the execution information extracted from Spark History Server as training data and develop a stage-aware hierarchical neural network model for performance prediction. Experimental results show that the proposed hierarchical model achieves higher accuracy than a holistic prediction model at the end-to-end level, and also outperforms other existing regression-based prediction methods.
在过去的十年中,各种科学、工程和业务领域的大规模应用程序的数据量经历了爆炸式的增长,并且已经远远超出了任何一台服务器的计算能力和存储容量。作为一种可行的解决方案,这些数据通常存储在分布式文件系统中,并由并行计算引擎处理,例如Spark,由于其在内存中快速处理流数据,它比传统的MapReduce框架越来越受欢迎。Spark引擎一般部署在亚马逊EC2、阿里云等云环境中。然而,这些云环境中的存储和计算资源通常是按现收现付的方式提供的,因此准确估计Spark工作负载的执行时间对于充分利用云资源和满足最终用户的性能需求至关重要。我们的见解是,许多Spark工作负载的执行模式在性质上是相似的,这使得利用历史性能数据来预测给定Spark应用程序的执行时间成为可能。我们使用从Spark History Server中提取的执行信息作为训练数据,并开发了一个阶段感知的分层神经网络模型用于性能预测。实验结果表明,该模型在端到端水平上比整体预测模型具有更高的精度,并且优于现有的基于回归的预测方法。