GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Companion of the 2018 ACM/SPEC International Conference on Performance Engineering Pub Date : 2021-04-19 DOI:10.1145/3447545.3451185

T. Hegeman, Matthijs Jansen, A. Iosup, A. Trivedi

{"title":"GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows","authors":"T. Hegeman, Matthijs Jansen, A. Iosup, A. Trivedi","doi":"10.1145/3447545.3451185","DOIUrl":null,"url":null,"abstract":"Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.","PeriodicalId":10596,"journal":{"name":"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2018 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447545.3451185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

查看原文本刊更多论文

GradeML:迈向机器学习工作流的整体性能分析

如今，机器学习(ML)工作负载几乎无处不在。在过去的十年里，为了使机器学习模型训练快速高效，人们付出了很多努力，例如，通过提出新的机器学习框架(如TensorFlow, PyTorch)，利用硬件支持(tpu, gpu, fpga)，以及实现新的执行模型(管道，分布式训练)。与这一趋势相匹配的是，人们也在专注于机器学习模型训练的性能分析工具上投入了大量精力。然而，正如我们在这项工作中发现的那样，机器学习模型训练很少孤立地进行，而是在更大的机器学习工作流程中的一个步骤。因此，令人惊讶的是，没有任何性能分析工具涵盖ML工作流的整个生命周期。为了解决这个巨大的概念差距，我们设想在这项工作中为机器学习工作流提供一个全面的性能分析工具。我们分析了实践状态和最先进的技术，提出了关于现有绩效工具性能的定量证据。我们为机器学习工作流的整体性能分析制定了四个设计支柱:统一的执行模型，轻量级的性能数据收集，高效的数据聚合和呈现，以及机器学习系统的紧密集成。最后，我们提出了实现GradeML愿景的第一步，GradeML是ML工作流的整体性能分析工具。我们的初步工作和实验是在https://github.com/atlarge-research/grademl上开源的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

自引率

0.00%

发文量