I/O performance analysis of machine learning workloads on leadership scale supercomputer

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Performance Evaluation Pub Date : 2022-10-01 DOI:10.1016/j.peva.2022.102318

Ahmad Maroof Karimi , Arnab K. Paul , Feiyi Wang

{"title":"I/O performance analysis of machine learning workloads on leadership scale supercomputer","authors":"Ahmad Maroof Karimi , Arnab K. Paul , Feiyi Wang","doi":"10.1016/j.peva.2022.102318","DOIUrl":null,"url":null,"abstract":"<div><p>The popularity of machine learning<span> technologies and frameworks has led to an increasingly large number of machine learning workloads running on high-performance computing (HPC) clusters. The ML workflows are readily being adopted in diverse computational fields such as Biology, Physics, Materials, and Computer Science. The I/O behavior of the emerging ML workloads distinctly differs from the traditional HPC workloads, such as simulation or checkpoint/restart-based HPC I/O behavior. Additionally, the ML workloads have also pushed for the utilization of GPUs or a combination of CPUs and GPUs in addition to using only CPUs for computational tasks. The diverse and complex I/O behavior of ML workloads requires extensive study and is critical for the efficient performance of various layers of the I/O stack and the overall performance of HPC workloads. This work aims to fill the gap in understanding the I/O behavior of emerging ML workloads by providing an in-depth analysis of ML jobs running on large-scale leadership HPC systems. In particular, we have analyzed the behavior of jobs based on the scale of the jobs, the science domains, and the processing units used by the ML jobs. The analysis was performed on 23,000 ML jobs collected from one year of Darshan logs running on Summit, which is one of the fastest supercomputers<span>. We also collect the CPU and GPU usage of 15,165 ML jobs by merging the Darshan dataset with the power usage of the processing units on Summit. Therefore, this paper is able to provide a systematic I/O characterization of ML workloads on a leadership scale HPC machine to understand how the I/O behavior differs for workloads across various science domains, the scale of workloads, and processing units and analyze the usage of parallel file system and burst buffer by ML I/O workloads. We have made several observations regarding I/O performances and access patterns through various analytical studies and discuss the important lessons learnt from the perspective of a ML user and a storage architect for emerging ML workloads running on large-scale supercomputers.</span></span></p></div>","PeriodicalId":19964,"journal":{"name":"Performance Evaluation","volume":null,"pages":null},"PeriodicalIF":1.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166531622000268","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 1

Abstract

The popularity of machine learning technologies and frameworks has led to an increasingly large number of machine learning workloads running on high-performance computing (HPC) clusters. The ML workflows are readily being adopted in diverse computational fields such as Biology, Physics, Materials, and Computer Science. The I/O behavior of the emerging ML workloads distinctly differs from the traditional HPC workloads, such as simulation or checkpoint/restart-based HPC I/O behavior. Additionally, the ML workloads have also pushed for the utilization of GPUs or a combination of CPUs and GPUs in addition to using only CPUs for computational tasks. The diverse and complex I/O behavior of ML workloads requires extensive study and is critical for the efficient performance of various layers of the I/O stack and the overall performance of HPC workloads. This work aims to fill the gap in understanding the I/O behavior of emerging ML workloads by providing an in-depth analysis of ML jobs running on large-scale leadership HPC systems. In particular, we have analyzed the behavior of jobs based on the scale of the jobs, the science domains, and the processing units used by the ML jobs. The analysis was performed on 23,000 ML jobs collected from one year of Darshan logs running on Summit, which is one of the fastest supercomputers. We also collect the CPU and GPU usage of 15,165 ML jobs by merging the Darshan dataset with the power usage of the processing units on Summit. Therefore, this paper is able to provide a systematic I/O characterization of ML workloads on a leadership scale HPC machine to understand how the I/O behavior differs for workloads across various science domains, the scale of workloads, and processing units and analyze the usage of parallel file system and burst buffer by ML I/O workloads. We have made several observations regarding I/O performances and access patterns through various analytical studies and discuss the important lessons learnt from the perspective of a ML user and a storage architect for emerging ML workloads running on large-scale supercomputers.

查看原文本刊更多论文

领导力规模超级计算机上机器学习工作负载的I/O性能分析

机器学习技术和框架的普及导致越来越多的机器学习工作负载运行在高性能计算(HPC)集群上。ML工作流程很容易被应用于不同的计算领域，如生物学、物理学、材料和计算机科学。新兴ML工作负载的I/O行为明显不同于传统的HPC工作负载，例如基于模拟或检查点/重启的HPC I/O行为。此外，除了仅使用cpu执行计算任务外，机器学习工作负载还推动了gpu或cpu和gpu组合的使用。ML工作负载的多样化和复杂的I/O行为需要广泛的研究，对于I/O堆栈各层的高效性能和HPC工作负载的整体性能至关重要。这项工作旨在通过对运行在大规模领先的HPC系统上的ML作业进行深入分析，填补理解新兴ML工作负载的I/O行为方面的空白。特别是，我们根据作业的规模、科学领域和ML作业使用的处理单元分析了作业的行为。该分析是对在Summit(最快的超级计算机之一)上运行的Darshan一年的日志中收集的23,000个ML作业进行的。我们还通过合并Darshan数据集和Summit上处理单元的功耗来收集15,165个ML作业的CPU和GPU使用情况。因此，本文能够在领先规模的HPC机器上提供ML工作负载的系统I/O表征，以了解不同科学领域、工作负载规模和处理单元的工作负载的I/O行为如何不同，并分析ML I/O工作负载对并行文件系统和突发缓冲区的使用情况。我们通过各种分析研究对I/O性能和访问模式进行了一些观察，并从ML用户和存储架构师的角度讨论了在大型超级计算机上运行的新兴ML工作负载的重要经验教训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Performance Evaluation 工程技术-计算机：理论方法

CiteScore

3.10

自引率

0.00%

发文量

审稿时长

24 days

期刊介绍： Performance Evaluation functions as a leading journal in the area of modeling, measurement, and evaluation of performance aspects of computing and communication systems. As such, it aims to present a balanced and complete view of the entire Performance Evaluation profession. Hence, the journal is interested in papers that focus on one or more of the following dimensions: -Define new performance evaluation tools, including measurement and monitoring tools as well as modeling and analytic techniques -Provide new insights into the performance of computing and communication systems -Introduce new application areas where performance evaluation tools can play an important role and creative new uses for performance evaluation tools. More specifically, common application areas of interest include the performance of: -Resource allocation and control methods and algorithms (e.g. routing and flow control in networks, bandwidth allocation, processor scheduling, memory management) -System architecture, design and implementation -Cognitive radio -VANETs -Social networks and media -Energy efficient ICT -Energy harvesting -Data centers -Data centric networks -System reliability -System tuning and capacity planning -Wireless and sensor networks -Autonomic and self-organizing systems -Embedded systems -Network science