Performance Modeling and Prediction of Big Data Workflows: An Exploratory Analysis

2020 29th International Conference on Computer Communications and Networks (ICCCN) Pub Date : 2020-08-01 DOI:10.1109/ICCCN49398.2020.9209715

Wuji Liu, C. Wu, Qianwen Ye, Aiqin Hou, Wei Shen

{"title":"Performance Modeling and Prediction of Big Data Workflows: An Exploratory Analysis","authors":"Wuji Liu, C. Wu, Qianwen Ye, Aiqin Hou, Wei Shen","doi":"10.1109/ICCCN49398.2020.9209715","DOIUrl":null,"url":null,"abstract":"Many next-generation scientific and business applications feature large-scale data-intensive workflows, which require massive computing resources for execution on high-performance clusters in cloud environments. Such computing resources (e.g., VCores and virtual memory) requested through parameter setting in big data systems, if not fully utilized by workloads, are simply wasted due to the nature of exclusive access made possible by containerization. This necessitates accurate modeling and prediction of workflow performance to make an effective recommendation of appropriate parameter settings to end users. However, it is challenging to determine optimal workflow and system configurations due to the large parameter space and the interaction between various technology layers of big data systems. Towards this goal, we propose a machine learning-based feature selection method to identify influential parameters based on historical performance measurements of Spark-based computing workloads executed in big data systems with YARN. We first identify a comprehensive set of parameters across multiple layers in the big data technology stack including workflow input structure, Spark computing engine, and YARN resource management. We then conduct an in-depth exploratory analysis of their individual and coupled impact on workflow performance, and develop a performance-influence model using random forest for prediction. Experimental results show that the proposed approach identifies important features for performance modeling and achieves high accuracy in performance prediction.","PeriodicalId":137835,"journal":{"name":"2020 29th International Conference on Computer Communications and Networks (ICCCN)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 29th International Conference on Computer Communications and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN49398.2020.9209715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Many next-generation scientific and business applications feature large-scale data-intensive workflows, which require massive computing resources for execution on high-performance clusters in cloud environments. Such computing resources (e.g., VCores and virtual memory) requested through parameter setting in big data systems, if not fully utilized by workloads, are simply wasted due to the nature of exclusive access made possible by containerization. This necessitates accurate modeling and prediction of workflow performance to make an effective recommendation of appropriate parameter settings to end users. However, it is challenging to determine optimal workflow and system configurations due to the large parameter space and the interaction between various technology layers of big data systems. Towards this goal, we propose a machine learning-based feature selection method to identify influential parameters based on historical performance measurements of Spark-based computing workloads executed in big data systems with YARN. We first identify a comprehensive set of parameters across multiple layers in the big data technology stack including workflow input structure, Spark computing engine, and YARN resource management. We then conduct an in-depth exploratory analysis of their individual and coupled impact on workflow performance, and develop a performance-influence model using random forest for prediction. Experimental results show that the proposed approach identifies important features for performance modeling and achieves high accuracy in performance prediction.

查看原文本刊更多论文

大数据工作流的性能建模与预测:探索性分析

许多下一代科学和业务应用程序都具有大规模数据密集型工作流，这需要大量计算资源才能在云环境中的高性能集群上执行。在大数据系统中，通过参数设置请求的计算资源(如虚拟核和虚拟内存)，如果没有被工作负载充分利用，就会因为容器化所带来的独占访问的性质而被浪费。这就需要对工作流性能进行准确的建模和预测，以便向最终用户有效地推荐适当的参数设置。然而，由于大数据系统的参数空间大，且各技术层之间相互作用，确定最佳的工作流程和系统配置是一项挑战。为了实现这一目标，我们提出了一种基于机器学习的特征选择方法，该方法基于YARN在大数据系统中执行的基于spark的计算工作负载的历史性能测量来识别有影响的参数。我们首先在大数据技术堆栈中确定了一组跨多个层的综合参数，包括工作流输入结构、Spark计算引擎和YARN资源管理。然后，我们进行了深入的探索性分析，它们对工作流性能的单个和耦合影响，并开发了一个使用随机森林进行预测的性能影响模型。实验结果表明，该方法识别了性能建模的重要特征，并取得了较高的性能预测精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 29th International Conference on Computer Communications and Networks (ICCCN)

自引率

0.00%

发文量