Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters

2018 27th International Conference on Computer Communication and Networks (ICCCN) Pub Date : 2018-07-01 DOI:10.1109/ICCCN.2018.8487329

Guang Yang, Yong Jiang, Qing Li, Xuya Jia, Mingwei Xu

{"title":"Cross-Layer Self-Similar Coflow Scheduling for Machine Learning Clusters","authors":"Guang Yang, Yong Jiang, Qing Li, Xuya Jia, Mingwei Xu","doi":"10.1109/ICCCN.2018.8487329","DOIUrl":null,"url":null,"abstract":"In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.","PeriodicalId":399145,"journal":{"name":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN.2018.8487329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.

查看原文本刊更多论文

机器学习集群的跨层自相似协同流调度

近年来，许多公司开发了各种分布式计算框架来处理集群中的机器学习(ML)任务。网络是ML系统的一个众所周知的瓶颈，集群需要对ML作业生成的巨大流量(每个流量高达1GB)进行有效的调度。Coflow已被证明是一种有效的抽象，用于调度此类数据并行应用程序的流。然而，当coflow特性事先未知时，以及TCP拥塞控制错误解释拥塞信号导致低吞吐量时，coflow调度策略的实现受到约束。幸运的是，一些ML作业所经历的流量模式支持用有限的信息推测完整的共流特征。因此，本文将这些ML作业的协同流归纳为自相似协同流，并提出了一种去中心化的自相似协同流调度程序Cicada。Cicada为每个coflow分配一个探测流来推测其在传输过程中的特征，并根据推测结果采用最短作业优先(SJF)将coflow划分为严格的优先级队列。为了实现对吞吐量敏感的机器学习作业的全带宽，并保证调度策略的实现，Cicada提出了优于先前工作的弹性传输层速率控制。大规模模拟表明，在信息不可知的情况下，Cicada完成协同流的速度比最先进的方案快2.08倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 27th International Conference on Computer Communication and Networks (ICCCN)

自引率

0.00%

发文量