Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters

Guang Yang, Yong Jiang, Qing Li, Xuya Jia, Mingwei Xu
{"title":"Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters","authors":"Guang Yang, Yong Jiang, Qing Li, Xuya Jia, Mingwei Xu","doi":"10.1109/ICCCN.2018.8487329","DOIUrl":null,"url":null,"abstract":"In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.","PeriodicalId":399145,"journal":{"name":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN.2018.8487329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.
机器学习集群的跨层自相似协同流调度
近年来,许多公司开发了各种分布式计算框架来处理集群中的机器学习(ML)任务。网络是ML系统的一个众所周知的瓶颈,集群需要对ML作业生成的巨大流量(每个流量高达1GB)进行有效的调度。Coflow已被证明是一种有效的抽象,用于调度此类数据并行应用程序的流。然而,当coflow特性事先未知时,以及TCP拥塞控制错误解释拥塞信号导致低吞吐量时,coflow调度策略的实现受到约束。幸运的是,一些ML作业所经历的流量模式支持用有限的信息推测完整的共流特征。因此,本文将这些ML作业的协同流归纳为自相似协同流,并提出了一种去中心化的自相似协同流调度程序Cicada。Cicada为每个coflow分配一个探测流来推测其在传输过程中的特征,并根据推测结果采用最短作业优先(SJF)将coflow划分为严格的优先级队列。为了实现对吞吐量敏感的机器学习作业的全带宽,并保证调度策略的实现,Cicada提出了优于先前工作的弹性传输层速率控制。大规模模拟表明,在信息不可知的情况下,Cicada完成协同流的速度比最先进的方案快2.08倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信