通过监督式机器学习预测作业资源提高高性能计算系统性能。

PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il... Pub Date : 2019-07-01 Epub Date: 2019-07-28 DOI:10.1145/3332186.3333041

Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon

{"title":"通过监督式机器学习预测作业资源提高高性能计算系统性能。","authors":"Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon","doi":"10.1145/3332186.3333041","DOIUrl":null,"url":null,"abstract":"High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.","PeriodicalId":93601,"journal":{"name":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3332186.3333041","citationCount":"23","resultStr":"{\"title\":\"Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.\",\"authors\":\"Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon\",\"doi\":\"10.1145/3332186.3333041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.\",\"PeriodicalId\":93601,\"journal\":{\"name\":\"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/3332186.3333041\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3332186.3333041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2019/7/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3332186.3333041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

高性能计算(HPC)系统是用于数据捕获、共享和分析的资源。我们的大多数HPC用户来自计算机科学以外的其他学科。包括计算机科学家在内的HPC用户在决定他们在集群上提交的作业所需的资源数量方面存在困难，并且觉得自己不够熟练。因此，会鼓励用户高估提交作业的资源，这样他们的作业就不会因为资源不足而中断。这个过程将浪费和吞噬高性能计算资源;因此，这将导致低效的集群利用。我们创建了一个监督机器学习模型，并将其集成到Slurm资源管理器模拟器中，以预测运行计算所需的内存资源(内存)和所需的时间。我们的模型使用了不同的机器学习算法。我们的目标是在Slurm上集成和测试提出的监督机器学习模型。我们使用了从HPC日志文件中选择的10000多个任务来评估我们集成模型的性能和准确性。我们的工作目的是通过预测所需作业内存资源的数量和每个特定作业所需的时间来提高Slurm的性能，以便使用我们的集成监督机器学习模型提高HPC系统的利用率。我们的结果表明，对于较大的作业，我们的模型有助于显著减少计算周转时间(对于大型作业，从5天减少到10小时)，大大提高HPC系统的利用率，并减少提交作业的平均等待时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.

查看原文本刊更多论文

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...

自引率

0.00%

发文量