Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs

Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun
{"title":"Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs","authors":"Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun","doi":"10.1109/ICDCS51616.2021.00085","DOIUrl":null,"url":null,"abstract":"We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS51616.2021.00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.
Harmony:一个针对多个分布式机器学习作业优化的调度框架
我们引入了Harmony,这是一个新的调度框架,它可以一起执行多个参数服务器机器学习训练任务,以提高集群资源利用率。Harmony协调位于同一位置的作业的细粒度执行和互补的资源使用,以避免争用,并在作业之间有效地共享资源。为了解决由于并发作业数量增加而带来的内存压力,Harmony使用了一种数据溢出/重新加载机制,该机制针对具有迭代执行模式的多个作业进行了优化。我们的评估表明,与为每个作业分配专用资源的传统方法相比,Harmony将集群资源利用率提高了1.65倍,从而将平均ML训练作业时间减少了约53%,并且将makespan(处理所有给定作业的总时间)减少了约38%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信