Primus:大规模数据中心网络的快速和健壮的集中路由

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Pub Date : 2021-05-10 DOI:10.1109/INFOCOM42981.2021.9488689

Guihua Zhou, Guo Chen, Fusheng Lin, Tingting Xu, D. Wei, Jianbing Wu, Li Chen, Yuanwei Lu, Andrew Qu, Hua Shao, Hongbo Jiang

{"title":"Primus:大规模数据中心网络的快速和健壮的集中路由","authors":"Guihua Zhou, Guo Chen, Fusheng Lin, Tingting Xu, D. Wei, Jianbing Wu, Li Chen, Yuanwei Lu, Andrew Qu, Hua Shao, Hongbo Jiang","doi":"10.1109/INFOCOM42981.2021.9488689","DOIUrl":null,"url":null,"abstract":"This paper presents a fast and robust centralized data center network (DCN) routing solution called Primus. For fast routing calculation, Primus uses centralized controller to collect/disseminates the network’s link-states (LS), and offload the actual routing calculation onto each switch. Observing that the routing changes can be classified into a few fixed patterns in DCNs which have regular topologies, we simplify each switch’s routing calculation into a table-lookup manner, i.e., comparing LS changes with pre-installed base topology and updating routing paths according to predefined rules. As such, the routing calculation time at each switch only needs 10s of us even in a large network topology containing 10K+ switches. For efficient controller fault-tolerance, Primus purposely uses reporter switch to ensure the LS updates successfully delivered to all affected switches. As such, Primus can use multiple stateless controllers and little redundant traffic to tolerate failures, which incurs little overhead under normal case, and keeps 10s of ms fast routing reaction time even under complex data-/control-plane failures. We design, implement and evaluate Primus with extensive experiments on Linux-machine controllers and white-box switches. Primus provides ~1200x and ~100x shorter convergence time than current distributed protocol BGP and the state-of-the-art centralized routing solution, respectively.","PeriodicalId":293079,"journal":{"name":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Primus: Fast and Robust Centralized Routing for Large-scale Data Center Networks\",\"authors\":\"Guihua Zhou, Guo Chen, Fusheng Lin, Tingting Xu, D. Wei, Jianbing Wu, Li Chen, Yuanwei Lu, Andrew Qu, Hua Shao, Hongbo Jiang\",\"doi\":\"10.1109/INFOCOM42981.2021.9488689\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a fast and robust centralized data center network (DCN) routing solution called Primus. For fast routing calculation, Primus uses centralized controller to collect/disseminates the network’s link-states (LS), and offload the actual routing calculation onto each switch. Observing that the routing changes can be classified into a few fixed patterns in DCNs which have regular topologies, we simplify each switch’s routing calculation into a table-lookup manner, i.e., comparing LS changes with pre-installed base topology and updating routing paths according to predefined rules. As such, the routing calculation time at each switch only needs 10s of us even in a large network topology containing 10K+ switches. For efficient controller fault-tolerance, Primus purposely uses reporter switch to ensure the LS updates successfully delivered to all affected switches. As such, Primus can use multiple stateless controllers and little redundant traffic to tolerate failures, which incurs little overhead under normal case, and keeps 10s of ms fast routing reaction time even under complex data-/control-plane failures. We design, implement and evaluate Primus with extensive experiments on Linux-machine controllers and white-box switches. Primus provides ~1200x and ~100x shorter convergence time than current distributed protocol BGP and the state-of-the-art centralized routing solution, respectively.\",\"PeriodicalId\":293079,\"journal\":{\"name\":\"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFOCOM42981.2021.9488689\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM42981.2021.9488689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一种快速、健壮的集中式数据中心网络(DCN)路由解决方案——Primus。为了快速计算路由，Primus使用集中控制器收集/传播网络的链路状态(LS)，并将实际的路由计算卸载到每个交换机上。观察到在具有规则拓扑结构的dcn中，路由变化可以划分为几种固定的模式，我们将每个交换机的路由计算简化为查找表的方式，即将LS变化与预先安装的基本拓扑进行比较，并根据预定义的规则更新路由路径。因此，即使在包含10K+交换机的大型网络拓扑中，每个交换机的路由计算时间也只需要10s。为了实现高效的控制器容错，Primus特意使用报告交换机来确保LS更新成功地传递到所有受影响的交换机。因此，Primus可以使用多个无状态控制器和少量冗余流量来容忍故障，在正常情况下产生的开销很小，即使在复杂的数据/控制平面故障下也能保持10ms的快速路由反应时间。我们通过在linux机器控制器和白盒开关上进行大量实验来设计、实现和评估Primus。与目前的分布式协议BGP和最先进的集中式路由解决方案相比，Primus的收敛时间分别缩短了约1200倍和约100倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Primus: Fast and Robust Centralized Routing for Large-scale Data Center Networks

This paper presents a fast and robust centralized data center network (DCN) routing solution called Primus. For fast routing calculation, Primus uses centralized controller to collect/disseminates the network’s link-states (LS), and offload the actual routing calculation onto each switch. Observing that the routing changes can be classified into a few fixed patterns in DCNs which have regular topologies, we simplify each switch’s routing calculation into a table-lookup manner, i.e., comparing LS changes with pre-installed base topology and updating routing paths according to predefined rules. As such, the routing calculation time at each switch only needs 10s of us even in a large network topology containing 10K+ switches. For efficient controller fault-tolerance, Primus purposely uses reporter switch to ensure the LS updates successfully delivered to all affected switches. As such, Primus can use multiple stateless controllers and little redundant traffic to tolerate failures, which incurs little overhead under normal case, and keeps 10s of ms fast routing reaction time even under complex data-/control-plane failures. We design, implement and evaluate Primus with extensive experiments on Linux-machine controllers and white-box switches. Primus provides ~1200x and ~100x shorter convergence time than current distributed protocol BGP and the state-of-the-art centralized routing solution, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications

自引率

0.00%

发文量