Assuming the best: Towards a reliable protocol for resource usage prediction for high-performance computing based on machine learning

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Alexandre H.L. Porto , Micaella Coelho , Hiago M.G.A. Rocha , Carla Osthoff , Kary Ocaña , Douglas O. Cardoso
{"title":"Assuming the best: Towards a reliable protocol for resource usage prediction for high-performance computing based on machine learning","authors":"Alexandre H.L. Porto ,&nbsp;Micaella Coelho ,&nbsp;Hiago M.G.A. Rocha ,&nbsp;Carla Osthoff ,&nbsp;Kary Ocaña ,&nbsp;Douglas O. Cardoso","doi":"10.1016/j.future.2025.108070","DOIUrl":null,"url":null,"abstract":"<div><div>In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness (<span><math><msup><mi>R</mi><mn>2</mn></msup></math></span> values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108070"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003644","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

In High-Performance Computing (HPC) systems, multiple processes simultaneously consume resources such as CPU time, memory, and electrical power, among others. Accurately predicting the resource consumption of a process based on its execution parameters enables more efficient resource allocation, ultimately improving the overall performance of the HPC system. While many studies have explored this topic, fewer explicitly examine the underlying assumptions of their approaches. This work contributes to filling that gap by proposing, experimenting with, and discussing a protocol to approach this problem, covering from the collection of processes footprint data to the experimental evaluation of Machine Learning models based on such data. The reported results of the assessment of this protocol in a case study of the RAxML bioinformatics application on a real supercomputer highlight not only its effectiveness (R2 values greater than 0.9 were achieved in most tests) but also the reasonableness of the assumptions considered.

Abstract Image

假设最好:基于机器学习的高性能计算资源使用预测的可靠协议
在高性能计算(HPC)系统中,多个进程同时消耗CPU时间、内存和电力等资源。根据进程的执行参数准确预测进程的资源消耗,可以实现更有效的资源分配,最终提高HPC系统的整体性能。虽然许多研究都探讨了这个话题,但很少有人明确地检查他们的方法的潜在假设。这项工作通过提出、实验和讨论解决这个问题的协议来填补这一空白,涵盖了从过程足迹数据的收集到基于这些数据的机器学习模型的实验评估。在实际超级计算机上RAxML生物信息学应用的案例研究中,对该协议的评估报告结果不仅突出了其有效性(在大多数测试中R2值大于0.9),而且还突出了所考虑的假设的合理性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信