Harmony:一个针对多个分布式机器学习作业优化的调度框架

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2021-07-01 DOI:10.1109/ICDCS51616.2021.00085

Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun

{"title":"Harmony:一个针对多个分布式机器学习作业优化的调度框架","authors":"Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun","doi":"10.1109/ICDCS51616.2021.00085","DOIUrl":null,"url":null,"abstract":"We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs\",\"authors\":\"Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun\",\"doi\":\"10.1109/ICDCS51616.2021.00085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.\",\"PeriodicalId\":222376,\"journal\":{\"name\":\"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"150 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS51616.2021.00085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS51616.2021.00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

我们引入了Harmony，这是一个新的调度框架，它可以一起执行多个参数服务器机器学习训练任务，以提高集群资源利用率。Harmony协调位于同一位置的作业的细粒度执行和互补的资源使用，以避免争用，并在作业之间有效地共享资源。为了解决由于并发作业数量增加而带来的内存压力，Harmony使用了一种数据溢出/重新加载机制，该机制针对具有迭代执行模式的多个作业进行了优化。我们的评估表明，与为每个作业分配专用资源的传统方法相比，Harmony将集群资源利用率提高了1.65倍，从而将平均ML训练作业时间减少了约53%，并且将makespan(处理所有给定作业的总时间)减少了约38%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs

We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量