Timely-Throughput Optimal Coded Computing over Cloud Networks

Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing Pub Date : 2019-04-11 DOI:10.1145/3323679.3326528

Chien-Sheng Yang, Ramtin Pedarsani, A. Avestimehr

{"title":"Timely-Throughput Optimal Coded Computing over Cloud Networks","authors":"Chien-Sheng Yang, Ramtin Pedarsani, A. Avestimehr","doi":"10.1145/3323679.3326528","DOIUrl":null,"url":null,"abstract":"In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. Motivated by measurements over Amazon EC2 clusters, we consider a two-state Markov model for variability of computing speed in cloud networks. In this model, each worker can be either in a good state or a bad state in terms of the computation speed, and the transition between these states is modeled as a Markov chain which is unknown to the scheduler. We then consider a Coded Computing framework, in which the data is possibly encoded and stored at the worker nodes in order to provide robustness against nodes that may be in a bad state. With timely computation requests submitted to the system with computation deadlines, our goal is to design the optimal computation-load allocation scheme and the optimal data encoding scheme that maximize the timely computation throughput (i.e, the average number of computation tasks that are accomplished before their deadline). Our main result is the development of a dynamic computation strategy called Lagrange Estimate-and-Allocate (LEA) strategy, which achieves the optimal timely computation throughput. It is shown that compared to the static allocation strategy, LEA improves the timely computation throughput by 1.4x ~ 17.5x in various scenarios via simulations and by 1.27x ~ 6.5x in experiments over Amazon EC2 clusters.","PeriodicalId":205641,"journal":{"name":"Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323679.3326528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. Motivated by measurements over Amazon EC2 clusters, we consider a two-state Markov model for variability of computing speed in cloud networks. In this model, each worker can be either in a good state or a bad state in terms of the computation speed, and the transition between these states is modeled as a Markov chain which is unknown to the scheduler. We then consider a Coded Computing framework, in which the data is possibly encoded and stored at the worker nodes in order to provide robustness against nodes that may be in a bad state. With timely computation requests submitted to the system with computation deadlines, our goal is to design the optimal computation-load allocation scheme and the optimal data encoding scheme that maximize the timely computation throughput (i.e, the average number of computation tasks that are accomplished before their deadline). Our main result is the development of a dynamic computation strategy called Lagrange Estimate-and-Allocate (LEA) strategy, which achieves the optimal timely computation throughput. It is shown that compared to the static allocation strategy, LEA improves the timely computation throughput by 1.4x ~ 17.5x in various scenarios via simulations and by 1.27x ~ 6.5x in experiments over Amazon EC2 clusters.

查看原文本刊更多论文

云网络上的实时吞吐量最优编码计算

在现代分布式计算系统中，不可预测和不可靠的基础设施导致了计算资源的高度可变性。同时，对具有截止日期约束的及时和事件驱动的服务的需求显著增加。受Amazon EC2集群测量的启发，我们考虑了云网络中计算速度可变性的双状态马尔可夫模型。在该模型中，每个工人在计算速度方面可以处于良好状态或不良状态，并且这些状态之间的转换被建模为调度程序未知的马尔可夫链。然后我们考虑一个编码计算框架，在这个框架中，数据可能被编码并存储在工作节点上，以便为可能处于不良状态的节点提供健壮性。在有计算截止日期的情况下，及时向系统提交计算请求，我们的目标是设计最优的计算负载分配方案和最优的数据编码方案，使及时计算吞吐量(即在计算截止日期前完成的计算任务的平均数量)最大化。我们的主要成果是开发了一种动态计算策略，称为拉格朗日估计和分配(LEA)策略，该策略实现了最佳的实时计算吞吐量。仿真结果表明，与静态分配策略相比，LEA在各种场景下的实时计算吞吐量提高了1.4 ~ 17.5倍，在Amazon EC2集群上的实验中提高了1.27 ~ 6.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing

自引率

0.00%

发文量