Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.

Aktuelle Traumatologie Pub Date : 2021-07-01 Epub Date: 2021-07-17 DOI:10.1145/3437359.3465574

Mohammed Tanash, Daniel Andresen, Huichen Yang, William Hsu

{"title":"Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.","authors":"Mohammed Tanash, Daniel Andresen, Huichen Yang, William Hsu","doi":"10.1145/3437359.3465574","DOIUrl":null,"url":null,"abstract":"In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm. Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)). We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters. Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources.","PeriodicalId":75462,"journal":{"name":"Aktuelle Traumatologie","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8974354/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aktuelle Traumatologie","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437359.3465574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/7/17 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm. Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)). We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters. Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources.

查看原文本刊更多论文

联合预测作业资源，提高基于 Slurm 的高性能计算系统的系统性能。

在本文中，我们提出了一种预测高性能计算系统上已提交作业的作业资源（内存和时间）的新方法。我们的方法基于 Slurm 工作负载管理器提供的历史作业数据（saccount 数据），并使用了有监督的机器学习方法。这种机器学习（ML）预测模型对高性能计算管理员和高性能计算用户都非常有效和有用。此外，我们的 ML 模型还能提高 HPC 系统的效率和利用率，从而降低功耗。我们的模型使用了scikit-learn机器学习库中的几个监督机器学习判别模型，并将LightGBM应用于Slurm的历史数据。我们的模型可以帮助高性能计算用户确定他们提交的作业所需的资源量，使他们更容易高效地使用高性能计算资源。这项工作为面向高性能计算服务提供商实施我们的通用开源工具迈出了第二步。在这项工作中，我们使用两个高性能计算服务提供商（XSEDE 服务提供商（科罗拉多大学博尔德分校（RMACC Summit）和堪萨斯州立大学（Beocat））实施并测试了我们的机器学习模型。我们使用了二十多万个作业：SUMMIT 的十万个作业和 Beocat 的十万个作业，对我们的 ML 模型性能进行建模和评估。我们特别测量了提交作业的运行时间、周转时间和平均等待时间的改善情况，并测量了高性能计算集群的利用率。我们的模型在预测 SUMMIT 和 Beocat HPC 资源的时间量和内存量方面的准确率高达 86%。我们的结果表明，我们的模型有助于大幅减少计算平均等待时间（RMACC Summit 从 380 小时减少到 4 小时，Beocat 从 662 小时减少到 28 小时）；减少周转时间（RMACC Summit 从 403 小时减少到 6 小时，Beocat 从 673 小时减少到 35 小时）；并使两种 HPC 资源的利用率均达到 100%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Aktuelle Traumatologie

自引率

0.00%

发文量