RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation

Q3 Computer Science

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI:10.1145/3477132.3483578

Andrew Newell, Dimitrios Skarlatos, Jingyuan Fan, Pavan Kumar, Maxim Khutornenko, Mayank Pundir, Yirui Zhang, Mingjun Zhang, Yuanlai Liu, Linh Le, Brendon Daugherty, Apurva Samudra, Prashasti Baid, James Kneeland, Igor Kabiljo, Dmitry Shchukin, André Rodrigues, S. Michelson, B. Christensen, K. Veeraraghavan, Chunqiang Tang

{"title":"RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation","authors":"Andrew Newell, Dimitrios Skarlatos, Jingyuan Fan, Pavan Kumar, Maxim Khutornenko, Mayank Pundir, Yirui Zhang, Mingjun Zhang, Yuanlai Liu, Linh Le, Brendon Daugherty, Apurva Samudra, Prashasti Baid, James Kneeland, Igor Kabiljo, Dmitry Shchukin, André Rodrigues, S. Michelson, B. Christensen, K. Veeraraghavan, Chunqiang Tang","doi":"10.1145/3477132.3483578","DOIUrl":null,"url":null,"abstract":"Capacity reservation is a common offering in public clouds and on-premise infrastructure. However, no prior work provides capacity reservation with SLO guarantees that takes into account random and correlated hardware failures, datacenter maintenance, and heterogeneous hardware. In this paper, we describe how Facebook's region-scale Resource Allowance System (RAS) addresses these issues and provides guaranteed capacity. RAS uses a capacity abstraction called reservation to represent a set of servers dynamically assigned to a logical cluster. We take a two-level approach to scale resource allocation to all datacenters in a region, where a mixed-integer-programming solver continuously optimizes server-to-reservation assignments off the critical path, and a traditional container allocator does real-time placement of containers on servers in a reservation. As a relatively new component of Facebook's 10-year old cluster manager Twine, RAS has been running in production for almost two years, continuously optimizing the allocation of millions of servers to thousands of reservations. We describe the design of RAS and share our experience of deploying it at scale.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3477132.3483578","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 10

Abstract

Capacity reservation is a common offering in public clouds and on-premise infrastructure. However, no prior work provides capacity reservation with SLO guarantees that takes into account random and correlated hardware failures, datacenter maintenance, and heterogeneous hardware. In this paper, we describe how Facebook's region-scale Resource Allowance System (RAS) addresses these issues and provides guaranteed capacity. RAS uses a capacity abstraction called reservation to represent a set of servers dynamically assigned to a logical cluster. We take a two-level approach to scale resource allocation to all datacenters in a region, where a mixed-integer-programming solver continuously optimizes server-to-reservation assignments off the critical path, and a traditional container allocator does real-time placement of containers on servers in a reservation. As a relatively new component of Facebook's 10-year old cluster manager Twine, RAS has been running in production for almost two years, continuously optimizing the allocation of millions of servers to thousands of reservations. We describe the design of RAS and share our experience of deploying it at scale.

查看原文本刊更多论文

RAS:持续优化的全区域数据中心资源分配

容量预留是公共云和内部部署基础设施中的常见产品。但是，之前的工作没有提供容量预留和考虑随机和相关硬件故障、数据中心维护和异构硬件的SLO保证。在本文中，我们描述了Facebook的区域规模资源补贴系统(RAS)如何解决这些问题并提供保证容量。RAS使用称为保留的容量抽象来表示动态分配给逻辑集群的一组服务器。我们采用两级方法将资源分配扩展到一个区域内的所有数据中心，其中混合整数规划求解器不断优化关键路径以外的服务器到预订的分配，而传统的容器分配器在预订中的服务器上实时放置容器。作为Facebook已有10年历史的集群管理器Twine的一个相对较新的组件，RAS已经在生产环境中运行了近两年，不断优化数百万台服务器到数千个预订的分配。我们描述了RAS的设计，并分享了我们大规模部署RAS的经验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Operating Systems Review (ACM) Computer Science-Computer Networks and Communications

CiteScore

2.80

自引率

0.00%

发文量

期刊介绍： Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.