Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, Enge Song, Jiao Zhang, Tao Huang, Shunmin Zhu
{"title":"Sailfish","authors":"Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, Enge Song, Jiao Zhang, Tao Huang, Shunmin Zhu","doi":"10.1145/3452296.3472889","DOIUrl":"https://doi.org/10.1145/3452296.3472889","url":null,"abstract":"The cloud gateway is essential in the public cloud as the central hub of cloud traffic. We show that horizontal scaling of software gateways, once sustainable for years, is no longer future-proof facing the massive scale and rapid growth of today's cloud. The root cause is the stagnant performance of the CPU core, which is prone to be overloaded by heavy hitters as traffic growth goes far beyond Moore's law. To address this, we propose emph{Sailfish}, a cloud-scale multi-tenant multi-service gateway accelerated by programmable switches. The new challenge is that large forwarding tables due to multi-tenancy cannot be fit into the limited on-chip memories. To this end, we devise a multi-pronged approach with (1) hardware/software co-design for table sharing, (2) horizontal table splitting among gateway clusters, (3) pipeline-aware table compression for a single node. Compared with the x86 gateway of a similar price, Sailfish reduces latency by 95% (2μs), improves throughput by more than 20x in bps (3.2Tbps) and 71x in pps (1.8Gpps) with packet length < 256B. Sailfish has been deployed in Alibaba Cloud for more than two years. It is the first P4-based cloud gateway in the industry, of which a single cluster carries dozens of Tbps traffic, withstanding peak-hour traffic in large online shopping festivals.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81637277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Congestion detection in lossless networks","authors":"Yiran Zhang, Yifan Liu, Qingkai Meng, Fengyuan Ren","doi":"10.1145/3452296.3472899","DOIUrl":"https://doi.org/10.1145/3452296.3472899","url":null,"abstract":"Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth observations and understandings, we reveal that existing congestion detection mechanisms in mainstream lossless networks (i.e., Converged Enhanced Ethernet and InfiniBand) are improper, due to failing to cognize the interaction between hop-by-hop flow controls and congestion detection behaviors in switches. We define ternary states of switch ports and present Ternary Congestion Detection (TCD) for mainstream lossless networks. Testbed and extensive simulations demonstrate that TCD can detect congestion ports accurately and identify flows contributing to congestion as well as flows only affected by hop-by-hop flow controls. Meanwhile, we shed light on how to incorporate TCD with rate control. Case studies show that existing congestion control algorithms can achieve 3.3x and 2.0x better median and 99th-percentile FCT slowdown by combining with TCD.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85491846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"mmTag","authors":"M. Mazaheri, Alex K Chen, Omid Abari","doi":"10.1145/3452296.3472917","DOIUrl":"https://doi.org/10.1145/3452296.3472917","url":null,"abstract":"Recent advances in IoT, machine learning and cloud computing have placed a huge strain on wireless networks. In particular, many emerging applications require streaming rich content (such as videos) in real time, while they are constrained by energy sources. A wireless network which supports high data-rate while consuming low-power would be very attractive for these applications. Unfortunately, existing wireless networks do not satisfy this requirement. For example, WiFi backscatter and Bluetooth networks have very low power consumption, but their data-rate is very limited (less than a Mbps). On the other hand, modern WiFi and mmWave networks support high throughput, but have a high power consumption (more than a watt). To address this problem, we present mmTag, a novel mmWave backscatter network which enables low-power high-throughput wireless links for emerging applications. mmTag is a backscatter system which operates in the mmWave frequency bands. mmTag addresses the key challenges that prevent existing backscatter networks from operating at mmWave bands. We implemented mmTag and evaluated its performance empirically. Our results show that mmTag is capable of achieving 1 Gbps and 100 Mbps at 4.6 m and 8 m, respectively, while consuming only 2.4 nJ/bit.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85730133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang Zhu, Varun Gupta, S. Ahuja, Yuandong Tian, Ying Zhang, Xin Jin
{"title":"Network planning with deep reinforcement learning","authors":"Hang Zhu, Varun Gupta, S. Ahuja, Yuandong Tian, Ying Zhang, Xin Jin","doi":"10.1145/3452296.3472902","DOIUrl":"https://doi.org/10.1145/3452296.3472902","url":null,"abstract":"Network planning is critical to the performance, reliability and cost of web services. This problem is typically formulated as an Integer Linear Programming (ILP) problem. Today's practice relies on hand-tuned heuristics from human experts to address the scalability challenge of ILP solvers. In this paper, we propose NeuroPlan, a deep reinforcement learning (RL) approach to solve the network planning problem. This problem involves multi-step decision making and cost minimization, which can be naturally cast as a deep RL problem. We develop two important domain-specific techniques. First, we use a graph neural network (GNN) and a novel domain-specific node-link transformation for state encoding, in order to handle the dynamic nature of the evolving network topology during planning decision making. Second, we leverage a two-stage hybrid approach that first uses deep RL to prune the search space and then uses an ILP solver to find the optimal solution. This approach resembles today's practice, but avoids human experts with an RL agent in the first stage. Evaluation on real topologies and setups from large production networks demonstrates that NeuroPlan scales to large topologies beyond the capability of ILP solvers, and reduces the cost by up to 17% compared to hand-tuned heuristics.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86117215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Mahimkar, A. Sivakumar, Zihui Ge, Shomik Pathak, Karunasish Biswas
{"title":"Auric","authors":"A. Mahimkar, A. Sivakumar, Zihui Ge, Shomik Pathak, Karunasish Biswas","doi":"10.1145/3452296.3472906","DOIUrl":"https://doi.org/10.1145/3452296.3472906","url":null,"abstract":"Cellular service providers add carriers in the network in order to support the increasing demand in voice and data traffic and provide good quality of service to the users. Addition of new carriers requires the network operators to accurately configure their parameters for the desired behaviors. This is a challenging problem because of the large number of parameters related to various functions like user mobility, interference management and load balancing. Furthermore, the same parameters can have varying values across different locations to manage user and traffic behaviors as planned and respond appropriately to different signal propagation patterns and interference. Manual configuration is time-consuming, tedious and error-prone, which could result in poor quality of service. In this paper, we propose a new data-driven recommendation approach Auric to automatically and accurately generate configuration parameters for new carriers added in cellular networks. Our approach incorporates new algorithms based on collaborative filtering and geographical proximity to automatically determine similarity across existing carriers. We conduct a thorough evaluation using real-world LTE network data and observe a high accuracy (96%) across a large number of carriers and configuration parameters. We also share experiences from our deployment and use of Auric in production environments.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"2014 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86704496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi
{"title":"SiP-ML: high-bandwidth optical network interconnects for machine learning training","authors":"Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi","doi":"10.1145/3452296.3472900","DOIUrl":"https://doi.org/10.1145/3452296.3472900","url":null,"abstract":"This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"218 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75619788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qizhen Zhang, K. K. W. Ng, Charles W. Kazer, Shen Yan, João Sedoc, Vincent Liu
{"title":"MimicNet","authors":"Qizhen Zhang, K. K. W. Ng, Charles W. Kazer, Shen Yan, João Sedoc, Vincent Liu","doi":"10.1145/3452296.3472926","DOIUrl":"https://doi.org/10.1145/3452296.3472926","url":null,"abstract":"At-scale evaluation of new data center network innovations is becoming increasingly intractable. This is true for testbeds, where few, if any, can afford a dedicated, full-scale replica of a data center. It is also true for simulations, which while originally designed for precisely this purpose, have struggled to cope with the size of today's networks. This paper presents an approach for quickly obtaining accurate performance estimates for large data center networks. Our system,MimicNet, provides users with the familiar abstraction of a packet-level simulation for a portion of the network while leveraging redundancy and recent advances in machine learning to quickly and accurately approximate portions of the network that are not directly visible. MimicNet can provide over two orders of magnitude speedup compared to regular simulation for a data center with thousands of servers. Even at this scale, MimicNet estimates of the tail FCT, throughput, and RTT are within 5% of the true results.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"141 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80296697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marwan M. Fayed, Lorenz Bauer, V. Giotsas, Sami Kerola, Marek Majkowski, Pavel Odintsov, Jakub Sitnicki, Taejoong Chung, Dave Levin, A. Mislove, Christopher A. Wood, N. Sullivan
{"title":"The ties that un-bind: decoupling IP from web services and sockets for robust addressing agility at CDN-scale","authors":"Marwan M. Fayed, Lorenz Bauer, V. Giotsas, Sami Kerola, Marek Majkowski, Pavel Odintsov, Jakub Sitnicki, Taejoong Chung, Dave Levin, A. Mislove, Christopher A. Wood, N. Sullivan","doi":"10.1145/3452296.3472922","DOIUrl":"https://doi.org/10.1145/3452296.3472922","url":null,"abstract":"The couplings between IP addresses, names of content or services, and socket interfaces, are too tight. This impedes system manageability, growth, and overall provisioning. In turn, large-scale content providers are forced to use staggering numbers of addresses, ultimately leading to address exhaustion (IPv4) and inefficiency (IPv6). In this paper, we revisit IP bindings, entirely. We attempt to evolve addressing conventions by decoupling IP in DNS and from network sockets. Alongside technologies such as SNI and ECMP, a new architecture emerges that ``unbinds'' IP from services and servers, thereby returning IP's role to merely that of reachability. The architecture is under evaluation at a major CDN in multiple datacenters. We show that addresses can be generated randomly emph{per-query}, for 20M+ domains and services, from as few as ~4K addresses, 256 addresses, and even emph{one} IP address. We explain why this approach is transparent to routing, L4/L7 load-balancers, distributed caching, and all surrounding systems -- and is emph{highly desirable}. Our experience suggests that many network-oriented systems and services (e.g., route leak mitigation, denial of service, measurement) could be improved, and new ones designed, if built with addressing agility.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81897879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhizhen Zhong, M. Ghobadi, Alaa Khaddaj, J. Leach, Yiting Xia, Ying Zhang
{"title":"ARROW","authors":"Zhizhen Zhong, M. Ghobadi, Alaa Khaddaj, J. Leach, Yiting Xia, Ying Zhang","doi":"10.1145/1963405.1963435","DOIUrl":"https://doi.org/10.1145/1963405.1963435","url":null,"abstract":"A drive-by download attack occurs when a user visits a webpage which attempts to automatically download malware without the user's consent. Attackers sometimes use a malware distribution network (MDN) to manage a large number of malicious webpages, exploits, and malware executables. In this paper, we provide a new method to determine these MDNs from the secondary URLs and redirect chains recorded by a high-interaction client honeypot. In addition, we propose a novel drive-by download detection method. Instead of depending on the malicious content used by previous methods, our algorithm first identifies and then leverages the URLs of the MDN's central servers, where a central server is a common server shared by a large percentage of the drive-by download attacks in the same MDN. A set of regular expression-based signatures are then generated based on the URLs of each central server. This method allows additional malicious webpages to be identified which launched but failed to execute a successful drive-by download attack. The new drive-by detection system named ARROW has been implemented, and we provide a large-scale evaluation on the output of a production drive-by detection system. The experimental results demonstrate the effectiveness of our method, where the detection coverage has been boosted by 96% with an extremely low false positive rate.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73723296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Tang, Siva Kesava Reddy Kakarla, Ryan Beckett, Ennan Zhai, Matt Brown, T. Millstein, Yuval Tamir, George Varghese
{"title":"Campion","authors":"Alan Tang, Siva Kesava Reddy Kakarla, Ryan Beckett, Ennan Zhai, Matt Brown, T. Millstein, Yuval Tamir, George Varghese","doi":"10.1145/3452296.3472925","DOIUrl":"https://doi.org/10.1145/3452296.3472925","url":null,"abstract":"We present a new approach for debugging two router configurations that are intended to be behaviorally equivalent. Existing router verification techniques cannot identify all differences or localize those differences to relevant configuration lines. Our approach addresses these limitations through a _modular_ analysis, which separately analyzes pairs of corresponding configuration components. It handles all router components that affect routing and forwarding, including configuration for BGP, OSPF, static routes, route maps and ACLs. Further, for many configuration components our modular approach enables simple _structural equivalence_ checks to be used without additional loss of precision versus modular semantic checks, aiding both efficiency and error localization. We implemented this approach in the tool Campion and applied it to debugging pairs of backup routers from different manufacturers and validating replacement of critical routers. Campion analyzed 30 proposed router replacements in a production cloud network and proactively detected four configuration bugs, including a route reflector bug that could have caused a severe outage. Campion also found multiple differences between backup routers from different vendors in a university network. These were undetected for three years, and depended on subtle semantic differences that the operators said they were \"highly unlikely\" to detect by \"just eyeballing the configs.\"","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80234267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}