Assuming the best: Towards a reliable protocol for resource usage prediction for high-performance computing based on machine learning

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-08-09 DOI:10.1016/j.future.2025.108070

Alexandre H.L. Porto , Micaella Coelho , Hiago M.G.A. Rocha , Carla Osthoff , Kary Ocaña , Douglas O. Cardoso

{"title":"Assuming the best: Towards a reliable protocol for resource usage prediction for high-performance computing based on machine learning","authors":"Alexandre H.L. Porto , Micaella Coelho , Hiago M.G.A. Rocha , Carla Osthoff , Kary Ocaña , Douglas O. Cardoso","doi":"10.1016/j.future.2025.108070","DOIUrl":null,"url":null,"abstract":"<div><div>In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness (<span><math><msup><mi>R</mi><mn>2</mn></msup></math></span> values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108070"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003644","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness (

R^{2}

values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.

Abstract Image

查看原文本刊更多论文

假设最好：基于机器学习的高性能计算资源使用预测的可靠协议

在高性能计算（HPC）系统中，多个进程同时消耗CPU时间、内存和电力等资源。根据进程的执行参数准确预测进程的资源消耗，可以实现更有效的资源分配，最终提高HPC系统的整体性能。虽然许多研究都探讨了这个话题，但很少有人明确地检查他们的方法的潜在假设。这项工作通过提出、实验和讨论解决这个问题的协议来填补这一空白，涵盖了从过程足迹数据的收集到基于这些数据的机器学习模型的实验评估。在实际超级计算机上RAxML生物信息学应用的案例研究中，对该协议的评估报告结果不仅突出了其有效性（在大多数测试中R2值大于0.9），而且还突出了所考虑的假设的合理性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.