{"title":"Collie: Finding Performance Anomalies in RDMA Subsystems","authors":"Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, Danyang Zhuo","doi":"10.48550/arXiv.2304.11467","DOIUrl":"https://doi.org/10.48550/arXiv.2304.11467","url":null,"abstract":"High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122458512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paras Jain, Sam Kumar, Sarah Wooders, Shishir G. Patil, Joseph Gonzalez, I. Stoica
{"title":"Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays","authors":"Paras Jain, Sam Kumar, Sarah Wooders, Shishir G. Patil, Joseph Gonzalez, I. Stoica","doi":"10.48550/arXiv.2210.07259","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07259","url":null,"abstract":"Cloud applications are increasingly distributing data across multiple regions and cloud providers. Unfortunately, wide-area bulk data transfers are often slow, bottlenecking applications. We demonstrate that it is possible to significantly improve inter-region cloud bulk transfer throughput by adapting network overlays to the cloud setting -- that is, by routing data through indirect paths at the application layer. However, directly applying network overlays in this setting can result in unacceptable increases in cloud egress prices. We present Skyplane, a system for bulk data transfer between cloud object stores that uses cloud-aware network overlays to optimally navigate the trade-off between price and performance. Skyplane's planner uses mixed-integer linear programming to determine the optimal overlay path and resource allocation for data transfer, subject to user-provided constraints on price or performance. Skyplane outperforms public cloud transfer services by up to $4.6times$ for transfers within one cloud and by up to $5.0times$ across clouds.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120961148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiuzhen Guo, Longfei Shangguan, Yuan He, Nan Jing, Jiacheng Zhang, Haotian Jiang, Yunhao Liu
{"title":"Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems","authors":"Xiuzhen Guo, Longfei Shangguan, Yuan He, Nan Jing, Jiacheng Zhang, Haotian Jiang, Yunhao Liu","doi":"10.48550/arXiv.2209.15348","DOIUrl":"https://doi.org/10.48550/arXiv.2209.15348","url":null,"abstract":"The radio range of backscatter systems continues growing as new wireless communication primitives are continuously invented. Nevertheless, both the bit error rate and the packet loss rate of backscatter signals increase rapidly with the radio range, thereby necessitating the cooperation between the access point and the backscatter tags through a feedback loop. Unfortunately, the low-power nature of backscatter tags limits their ability to demodulate feedback signals from a remote access point and scales down to such circumstances. This paper presents Saiyan, an ultra-low-power demodulator for long-range LoRa backscatter systems. With Saiyan, a backscatter tag can demodulate feedback signals from a remote access point with moderate power consumption and then perform an immediate packet retransmission in the presence of packet loss. Moreover, Saiyan enables rate adaption and channel hopping-two PHY-layer operations that are important to channel efficiency yet unavailable on long-range backscatter systems. We prototype Saiyan on a two-layer PCB board and evaluate its performance in different environments. Results show that Saiyan achieves 5 gain on the demodulation range, compared with state-of-the-art systems. Our ASIC simulation shows that the power consumption of Saiyan is around 93.2 uW. Code and hardware schematics can be found at: https://github.com/ZangJac/Saiyan.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121280854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training","authors":"Jie You, Jaehoon Chung, Mosharaf Chowdhury","doi":"10.48550/arXiv.2208.06102","DOIUrl":"https://doi.org/10.48550/arXiv.2208.06102","url":null,"abstract":"Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson
{"title":"Scalable Tail Latency Estimation for Data Center Networks","authors":"Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson","doi":"10.48550/arXiv.2205.01234","DOIUrl":"https://doi.org/10.48550/arXiv.2205.01234","url":null,"abstract":"In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116720196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dashlet: Taming Swipe Uncertainty for Robust Short Video Streaming","authors":"Zhuqi Li, Yaxiong Xie, R. Netravali, K. Jamieson","doi":"10.48550/arXiv.2204.12954","DOIUrl":"https://doi.org/10.48550/arXiv.2204.12954","url":null,"abstract":"Short video streaming applications have recently gained substantial traction, but the non-linear video presentation they afford swiping users fundamentally changes the problem of maximizing user quality of experience in the face of the vagaries of network throughput and user swipe timing. This paper describes the design and implementation of Dashlet, a system tailored for high quality of experience in short video streaming applications. With the insights we glean from an in-the-wild TikTok performance study and a user study focused on swipe patterns, Dashlet proposes a novel out-of-order video chunk pre-buffering mechanism that leverages a simple, non machine learning-based model of users' swipe statistics to determine the pre-buffering order and bitrate. The net result is a system that achieves 77-99% of an oracle system's QoE and outperforms TikTok by 43.9-45.1x, while also reducing by 30% the number of bytes wasted on downloaded video that is never watched.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122097790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu
{"title":"Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs","authors":"John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu","doi":"10.48550/arXiv.2204.12013","DOIUrl":"https://doi.org/10.48550/arXiv.2204.12013","url":null,"abstract":"DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where\"pipeline bubbles\"naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128147843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenxi Wang, Yifan Qiao, Haoran Ma, Shiafun Liu, Yiying Zhang, Wenguang Chen, R. Netravali, Miryung Kim, Guoqing Harry Xu
{"title":"Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory","authors":"Chenxi Wang, Yifan Qiao, Haoran Ma, Shiafun Liu, Yiying Zhang, Wenguang Chen, R. Netravali, Miryung Kim, Guoqing Harry Xu","doi":"10.48550/arXiv.2203.09615","DOIUrl":"https://doi.org/10.48550/arXiv.2203.09615","url":null,"abstract":"Remote memory techniques for datacenter applications have recently gained a great deal of popularity. Existing remote memory techniques focus on the efficiency of a single application setting only. However, when multiple applications co-run on a remote-memory system, significant interference could occur, resulting in unexpected slowdowns even if the same amounts of physical resources are granted to each application. This slowdown stems from massive sharing in applications' swap data paths. Canvas is a redesigned swap system that fully isolates swap paths for remote-memory applications. Canvas allows each application to possess its dedicated swap partition, swap cache, prefetcher, and RDMA bandwidth. Swap isolation lays a foundation for adaptive optimization techniques based on each application's own access patterns and needs. We develop three such techniques: (1) adaptive swap entry allocation, (2) semantics-aware prefetching, and (3) two-dimensional RDMA scheduling. A thorough evaluation with a set of widely-deployed applications demonstrates that Canvas minimizes performance variation and dramatically reduces performance degradation.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125054163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stardust: Divide and Conquer in the Data Center Network","authors":"Noa Zilberman, Gabriel Bracha, Golan Schzukin","doi":"10.17863/CAM.36895","DOIUrl":"https://doi.org/10.17863/CAM.36895","url":null,"abstract":"Building scalable data centers, and network devices that fit within these data centers, has become increasingly hard. With modern switches pushing at the boundary of manufacturing feasibility, being able to build suitable, and scalable network fabrics becomes of critical importance. We introduce Stardust, a fabric architecture for data center scale networks, inspired by network-switch systems. Stardust combines packet switches at the edge and disaggregated cell switches at the network fabric, using scheduled traffic. Stardust is a distributed solution that attends to the scale limitations of network-switch design, while also offering improved performance and power savings compared with traditional solutions. With ever-increasing networking requirements, Stardust predicts the elimination of packet switches, replaced by cell switches in the network, and smart network hardware at the hosts.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125877103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Burkhalter, Anwar Hithnawi, Alexander Viand, Hossein Shafagh, S. Ratnasamy
{"title":"TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access Control","authors":"Lukas Burkhalter, Anwar Hithnawi, Alexander Viand, Hossein Shafagh, S. Ratnasamy","doi":"10.3929/ETHZ-B-000402391","DOIUrl":"https://doi.org/10.3929/ETHZ-B-000402391","url":null,"abstract":"A growing number of devices and services collect detailed time series data that is stored in the cloud. Protecting the confidentiality of this vast and continuously generated data is an acute need for many applications in this space. At the same time, we must preserve the utility of this data by enabling authorized services to securely and selectively access and run analytics. This paper presents TimeCrypt, a system that provides scalable and real-time analytics over large volumes of encrypted time series data. TimeCrypt allows users to define expressive data access and privacy policies and enforces it cryptographically via encryption. In TimeCrypt, data is encrypted end-to-end, and authorized parties can only decrypt and verify queries within their authorized access scope. Our evaluation of TimeCrypt shows that its memory overhead and performance are competitive and close to operating on data in the clear.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121024012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}