Alexandre H.L. Porto , Micaella Coelho , Hiago M.G.A. Rocha , Carla Osthoff , Kary Ocaña , Douglas O. Cardoso
{"title":"Assuming the best: Towards a reliable protocol for resource usage prediction for high-performance computing based on machine learning","authors":"Alexandre H.L. Porto , Micaella Coelho , Hiago M.G.A. Rocha , Carla Osthoff , Kary Ocaña , Douglas O. Cardoso","doi":"10.1016/j.future.2025.108070","DOIUrl":null,"url":null,"abstract":"<div><div>In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness (<span><math><msup><mi>R</mi><mn>2</mn></msup></math></span> values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108070"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003644","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness ( values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.