Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon
{"title":"通过监督式机器学习预测作业资源提高高性能计算系统性能。","authors":"Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon","doi":"10.1145/3332186.3333041","DOIUrl":null,"url":null,"abstract":"<p><p>High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.</p>","PeriodicalId":93601,"journal":{"name":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3332186.3333041","citationCount":"23","resultStr":"{\"title\":\"Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.\",\"authors\":\"Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, Adedolapo Okanlawon\",\"doi\":\"10.1145/3332186.3333041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.</p>\",\"PeriodicalId\":93601,\"journal\":{\"name\":\"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/3332186.3333041\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3332186.3333041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2019/7/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PEARC19 : Practice and Experience in Advanced Research Computing 2019 : Rise of the Machines (learning) : July 28-August 1, 2019, Chicago, Illinois. Practice and Experience in Advanced Research Computing (Conference) (2019 : Chicago, Il...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3332186.3333041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning.
High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.