Monitoring and Characterizing GPU Usage

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-01-16 DOI:10.1002/cpe.8341

Le Mai Weakley, Scott Michael, Laura Huber, Abhinav Thota, Ben Fulton, Matthew Kusz

{"title":"Monitoring and Characterizing GPU Usage","authors":"Le Mai Weakley, Scott Michael, Laura Huber, Abhinav Thota, Ben Fulton, Matthew Kusz","doi":"10.1002/cpe.8341","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>For systems with an accelerator component, it is important from an operational and planning perspective to understand how and to what extent the accelerators are being used. Having a framework for tracking the utilization of accelerator resources is important both for judging how efficiently used a system is and for capacity and configuration planning of future systems. In addition to tracking total utilization and accelerator efficiency numbers, some attention should also be paid to the types of research and workflows that are being executed on the system. In the past, the demand for accelerator resources was largely driven by more traditional simulation codes, such as molecular dynamics. But with the growing popularity of deep learning and artificial intelligence workflows, accelerators have become even more highly sought after and are being used in new ways. Provisioning resources to researchers via an allocation system allows sites to track a project's usage and workflow as well as the scientific impact of the project. With such tools and data in hand, characterizing the GPU utilization of deep learning frameworks versus more traditional GPU-enabled applications becomes possible. In this paper we present a survey of GPU monitoring tools used in sites and a framework for tracking the utilization of NVIDIA GPUs on Slurm-scheduled HPC systems used at Indiana University. We also present an analysis of accelerator utilization on multiple systems, including an HPE Apollo system targeting AI workflows and a Cray EX system.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 3","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.8341","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

For systems with an accelerator component, it is important from an operational and planning perspective to understand how and to what extent the accelerators are being used. Having a framework for tracking the utilization of accelerator resources is important both for judging how efficiently used a system is and for capacity and configuration planning of future systems. In addition to tracking total utilization and accelerator efficiency numbers, some attention should also be paid to the types of research and workflows that are being executed on the system. In the past, the demand for accelerator resources was largely driven by more traditional simulation codes, such as molecular dynamics. But with the growing popularity of deep learning and artificial intelligence workflows, accelerators have become even more highly sought after and are being used in new ways. Provisioning resources to researchers via an allocation system allows sites to track a project's usage and workflow as well as the scientific impact of the project. With such tools and data in hand, characterizing the GPU utilization of deep learning frameworks versus more traditional GPU-enabled applications becomes possible. In this paper we present a survey of GPU monitoring tools used in sites and a framework for tracking the utilization of NVIDIA GPUs on Slurm-scheduled HPC systems used at Indiana University. We also present an analysis of accelerator utilization on multiple systems, including an HPE Apollo system targeting AI workflows and a Cray EX system.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.