Symposium on Networked Systems Design and Implementation最新文献

Collie: Finding Performance Anomalies in RDMA Subsystems 柯利:在RDMA子系统中发现性能异常

Symposium on Networked Systems Design and Implementation Pub Date : 2023-04-22 DOI: 10.48550/arXiv.2304.11467

Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, Danyang Zhuo

{"title":"Collie: Finding Performance Anomalies in RDMA Subsystems","authors":"Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, Danyang Zhuo","doi":"10.48550/arXiv.2304.11467","DOIUrl":"https://doi.org/10.48550/arXiv.2304.11467","url":null,"abstract":"High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122458512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays Skyplane:使用云感知覆盖优化传输成本和吞吐量

Symposium on Networked Systems Design and Implementation Pub Date : 2022-10-13 DOI: 10.48550/arXiv.2210.07259

Paras Jain, Sam Kumar, Sarah Wooders, Shishir G. Patil, Joseph Gonzalez, I. Stoica

引用次数: 9

Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems LoRa反向散射系统低功耗解调器的设计与实现

Symposium on Networked Systems Design and Implementation Pub Date : 2022-09-30 DOI: 10.48550/arXiv.2209.15348

Xiuzhen Guo, Longfei Shangguan, Yuan He, Nan Jing, Jiacheng Zhang, Haotian Jiang, Yunhao Liu

{"title":"Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems","authors":"Xiuzhen Guo, Longfei Shangguan, Yuan He, Nan Jing, Jiacheng Zhang, Haotian Jiang, Yunhao Liu","doi":"10.48550/arXiv.2209.15348","DOIUrl":"https://doi.org/10.48550/arXiv.2209.15348","url":null,"abstract":"The radio range of backscatter systems continues growing as new wireless communication primitives are continuously invented. Nevertheless, both the bit error rate and the packet loss rate of backscatter signals increase rapidly with the radio range, thereby necessitating the cooperation between the access point and the backscatter tags through a feedback loop. Unfortunately, the low-power nature of backscatter tags limits their ability to demodulate feedback signals from a remote access point and scales down to such circumstances. This paper presents Saiyan, an ultra-low-power demodulator for long-range LoRa backscatter systems. With Saiyan, a backscatter tag can demodulate feedback signals from a remote access point with moderate power consumption and then perform an immediate packet retransmission in the presence of packet loss. Moreover, Saiyan enables rate adaption and channel hopping-two PHY-layer operations that are important to channel efficiency yet unavailable on long-range backscatter systems. We prototype Saiyan on a two-layer PCB board and evaluate its performance in different environments. Results show that Saiyan achieves 5 gain on the demodulation range, compared with state-of-the-art systems. Our ASIC simulation shows that the power consumption of Saiyan is around 93.2 uW. Code and hardware schematics can be found at: https://github.com/ZangJac/Saiyan.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121280854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training Zeus:理解和优化DNN训练的GPU能耗

Symposium on Networked Systems Design and Implementation Pub Date : 2022-08-12 DOI: 10.48550/arXiv.2208.06102

Jie You, Jaehoon Chung, Mosharaf Chowdhury

引用次数: 21

Scalable Tail Latency Estimation for Data Center Networks 数据中心网络的可扩展尾延迟估计

Symposium on Networked Systems Design and Implementation Pub Date : 2022-05-02 DOI: 10.48550/arXiv.2205.01234

Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson

{"title":"Scalable Tail Latency Estimation for Data Center Networks","authors":"Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas E. Anderson","doi":"10.48550/arXiv.2205.01234","DOIUrl":"https://doi.org/10.48550/arXiv.2205.01234","url":null,"abstract":"In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116720196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Dashlet: Taming Swipe Uncertainty for Robust Short Video Streaming Dashlet:驯服滑动不确定性的稳健短视频流

Symposium on Networked Systems Design and Implementation Pub Date : 2022-04-27 DOI: 10.48550/arXiv.2204.12954

Zhuqi Li, Yaxiong Xie, R. Netravali, K. Jamieson

引用次数: 3

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs Bamboo:为大型dnn的可负担训练使可抢占实例具有弹性

Symposium on Networked Systems Design and Implementation Pub Date : 2022-04-26 DOI: 10.48550/arXiv.2204.12013

John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu

{"title":"Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs","authors":"John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu","doi":"10.48550/arXiv.2204.12013","DOIUrl":"https://doi.org/10.48550/arXiv.2204.12013","url":null,"abstract":"DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where\"pipeline bubbles\"naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128147843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory Canvas:远程内存上的多应用程序的隔离和自适应交换

Symposium on Networked Systems Design and Implementation Pub Date : 2022-03-17 DOI: 10.48550/arXiv.2203.09615

Chenxi Wang, Yifan Qiao, Haoran Ma, Shiafun Liu, Yiying Zhang, Wenguang Chen, R. Netravali, Miryung Kim, Guoqing Harry Xu

{"title":"Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory","authors":"Chenxi Wang, Yifan Qiao, Haoran Ma, Shiafun Liu, Yiying Zhang, Wenguang Chen, R. Netravali, Miryung Kim, Guoqing Harry Xu","doi":"10.48550/arXiv.2203.09615","DOIUrl":"https://doi.org/10.48550/arXiv.2203.09615","url":null,"abstract":"Remote memory techniques for datacenter applications have recently gained a great deal of popularity. Existing remote memory techniques focus on the efficiency of a single application setting only. However, when multiple applications co-run on a remote-memory system, significant interference could occur, resulting in unexpected slowdowns even if the same amounts of physical resources are granted to each application. This slowdown stems from massive sharing in applications' swap data paths. Canvas is a redesigned swap system that fully isolates swap paths for remote-memory applications. Canvas allows each application to possess its dedicated swap partition, swap cache, prefetcher, and RDMA bandwidth. Swap isolation lays a foundation for adaptive optimization techniques based on each application's own access patterns and needs. We develop three such techniques: (1) adaptive swap entry allocation, (2) semantics-aware prefetching, and (3) two-dimensional RDMA scheduling. A thorough evaluation with a set of widely-deployed applications demonstrates that Canvas minimizes performance variation and dramatically reduces performance degradation.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125054163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Stardust: Divide and Conquer in the Data Center Network 星尘:分而治之的数据中心网络

Symposium on Networked Systems Design and Implementation Pub Date : 2019-02-26 DOI: 10.17863/CAM.36895

Noa Zilberman, Gabriel Bracha, Golan Schzukin

引用次数: 22

TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access Control TimeCrypt:大规模加密数据流处理与加密访问控制

Symposium on Networked Systems Design and Implementation Pub Date : 2018-11-08 DOI: 10.3929/ETHZ-B-000402391

Lukas Burkhalter, Anwar Hithnawi, Alexander Viand, Hossein Shafagh, S. Ratnasamy

引用次数: 25