Performance Prediction for Data-driven Workflows on Apache Spark

Andrea Gulino, Arif Canakoglu, S. Ceri, D. Ardagna
{"title":"Performance Prediction for Data-driven Workflows on Apache Spark","authors":"Andrea Gulino, Arif Canakoglu, S. Ceri, D. Ardagna","doi":"10.1109/MASCOTS50786.2020.9285944","DOIUrl":null,"url":null,"abstract":"Spark is an in-memory framework for implementing distributed applications of various types. Predicting the execution time of Spark applications is an important but challenging problem that has been tackled in the past few years by several studies; most of them achieving good prediction accuracy on simple applications (e.g. known ML algorithms or SQL-based applications). In this work, we consider complex data-driven workflow applications, in which the execution and data flow can be modeled by Directly Acyclic Graphs (DAGs). Workflows can be made of an arbitrary combination of known tasks, each applying a set of Spark operations to their input data. By adopting a hybrid approach, combining analytical and machine learning (ML) models, trained on small DAGs, we can predict, with good accuracy, the execution time of unseen workflows of higher complexity and size. We validate our approach through an extensive experimentation on real-world complex applications, comparing different ML models and choices of feature sets.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285944","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Spark is an in-memory framework for implementing distributed applications of various types. Predicting the execution time of Spark applications is an important but challenging problem that has been tackled in the past few years by several studies; most of them achieving good prediction accuracy on simple applications (e.g. known ML algorithms or SQL-based applications). In this work, we consider complex data-driven workflow applications, in which the execution and data flow can be modeled by Directly Acyclic Graphs (DAGs). Workflows can be made of an arbitrary combination of known tasks, each applying a set of Spark operations to their input data. By adopting a hybrid approach, combining analytical and machine learning (ML) models, trained on small DAGs, we can predict, with good accuracy, the execution time of unseen workflows of higher complexity and size. We validate our approach through an extensive experimentation on real-world complex applications, comparing different ML models and choices of feature sets.
Apache Spark上数据驱动工作流的性能预测
Spark是一个内存框架,用于实现各种类型的分布式应用程序。预测Spark应用程序的执行时间是一个重要但具有挑战性的问题,在过去的几年里,一些研究已经解决了这个问题;它们中的大多数在简单的应用程序(例如已知的ML算法或基于sql的应用程序)上实现了良好的预测精度。在这项工作中,我们考虑了复杂的数据驱动工作流应用程序,其中执行和数据流可以通过直接无环图(dag)建模。工作流可以由已知任务的任意组合组成,每个任务对其输入数据应用一组Spark操作。通过采用混合方法,结合分析和机器学习(ML)模型,在小dag上训练,我们可以很准确地预测更高复杂性和规模的未见工作流的执行时间。我们通过在现实世界的复杂应用程序上进行广泛的实验来验证我们的方法,比较不同的ML模型和特征集的选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信