{"title":"BULB: Lightweight and Automated Load Balancing for Fast Datacenter Networks","authors":"Yuan Liu, Wenxin Li, W. Qu, Heng Qi","doi":"10.1145/3545008.3545021","DOIUrl":"https://doi.org/10.1145/3545008.3545021","url":null,"abstract":"Load balancing is essential for datacenter networks. However, prior solutions have significant limitations: they either are oblivious to congestion or involve a daunting and time-consuming parameter-tunning task over their heuristics for achieving good performance. Thus, we ask: is it possible to learn to balance datacenter traffic? While deep reinforcement learning (DRL) sounds like a good answer, we observe that it is too heavyweight due to the long decision-making latency. Therefore, we introduce BULB, a lightweight and automated datacenter load balancer. BULB learns link weights to guide the end-hosts to spread traffic, so as to free the central agent from quick flow-level decision-making. BULB offline trains a DRL agent for optimizing link weights but employs an imitation learning based approach to faithfully translate this agent’s DNN to a decision tree for online deployment. We implement a BULB prototype with a popular machine learning framework and evaluate it extensively in ns-3. The results show that BULB achieves up to 36.6%/56.4%, 19.9%/42.5%, 35.9%/54.8%, and 45.1%/67.7% better average/tail flow completion time than ECMP, CONGA, LetFlow, and Hermes, respectively. Moreover, BULB reduces the decision latency by 175 times while incurring only 2% performance loss after converting the DNN into a decision tree.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117296882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhan Wu, Zhuochen Fan, Qilong Shi, Yixin Zhang, Tong Yang, Cheng Chen, Zheng Zhong, Junnan Li, A. Shtul, Yaofeng Tu
{"title":"SHE: A Generic Framework for Data Stream Mining over Sliding Windows","authors":"Yuhan Wu, Zhuochen Fan, Qilong Shi, Yixin Zhang, Tong Yang, Cheng Chen, Zheng Zhong, Junnan Li, A. Shtul, Yaofeng Tu","doi":"10.1145/3545008.3545009","DOIUrl":"https://doi.org/10.1145/3545008.3545009","url":null,"abstract":"1Data stream mining over a sliding window is a fundamental problem in many applications, such as financial data trackers, intrusion detection and QoS. To meet the demand for high throughput of high speed data streams, sliding window algorithms turn to hardware platforms including FPGA/ASIC and programmable switches. These hardware platforms have three constraints for algorithms running on, which are 1) small memory usage 2) single stage memory access and 3) limited concurrent memory access. Algorithms perfectly fit in with these constraints will enable a highest utilization of these hardware platforms. However, no existing sliding window algorithm is specifically designed for hardware platforms. In this paper, we propose the Sliding Hardware Estimator (SHE), which is a generic framework that extends existing fixed window algorithms to sliding windows on hardware platforms. The key idea of SHE is that, during insertions we approximately delete out-dated information with little time and space overhead, while during queries we design sophisticated techniques to minimize error. We have fully implemented our SHE on FPGA, achieving a throughput of 544 Mips. We apply SHE to four typical data stream mining tasks. Experimental results show that, when compared with the state-of-the-art which cannot be implemented in hardware, SHE reduces the error by up to 100 times in membership queries. All related source codes are released at Github.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126309446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud","authors":"L. Liu, Jian Yu, Zhijun Ding","doi":"10.1145/3545008.3545027","DOIUrl":"https://doi.org/10.1145/3545008.3545027","url":null,"abstract":"Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133100625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Counting Induced 6-Cycles in Bipartite Graphs","authors":"Jason Niu, J. Zola, Ahmet Erdem Sarıyüce","doi":"10.1145/3545008.3545076","DOIUrl":"https://doi.org/10.1145/3545008.3545076","url":null,"abstract":"Various complex networks in real-world applications are best represented as a bipartite graph, such as user-product, paper-author, and actor-movie relations. Motif-based analysis has substantial benefits for networks and bipartite graphs are no exception. The smallest non-trivial subgraph in a bipartite graph is a (2,2)-biclique, also known as a butterfly. Although butterflies are succinct, they are limited in capturing the higher-order relations between more than two nodes from the same node set. One promising structure in this context is the induced 6-cycle which consists of three nodes on each node set forming a cycle where each node has exactly two edges. In this paper, we study the problem of counting induced 6-cycles through parallel algorithms. To the best of our knowledge, this is the first study on induced 6-cycle counting. We first consider two adaptations based on previous works for cycle counting in bipartite networks. Then, we introduce a new approach based on the node triplets and offer a systematic way to count the induced 6-cycles. Our final algorithm, BatchTripletJoin, is parallelizable across root nodes and uses minimal global storage to save memory. Our experimental evaluation on a 52 core machine shows that BatchTripletJoin is significantly faster than the other algorithms while being scalable to large graph sizes and number of cores. On a network with 112M edges, BatchTripletJoin is able to finish the computation in 78 mins by using 52 threads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129044795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kohei Yoshida, Rio Sageyama, Shinobu Miwa, Hayato Yamaki, H. Honda
{"title":"Analyzing Performance and Power-Efficiency Variations among NVIDIA GPUs","authors":"Kohei Yoshida, Rio Sageyama, Shinobu Miwa, Hayato Yamaki, H. Honda","doi":"10.1145/3545008.3545084","DOIUrl":"https://doi.org/10.1145/3545008.3545084","url":null,"abstract":"Understanding the variations in performance and power-efficiency of compute nodes is important for enhancing these factors in modern supercomputing systems. Previous studies have focused on variations in CPUs and DRAMs, but there has been little attention on GPUs. This is despite many current supercomputing systems employing GPUs (which consume a significant fraction of the power of such systems) as power-efficient accelerators for HPC applications. This paper describes the first thorough evaluation of performance and power-efficiency variations in GPUs. Specifically, we execute 48 CUDA kernels on 856 devices selected from three generations of NVIDIA GPUs (P100, V100, and A100), and analyze the impact of differences in both the CUDA kernels and GPU generation on performance and power-efficiency. Our analysis shows that there are non-negligible variations in both performance and power-efficiency, and that these variations are strongly affected by both the kernels that are running and the GPU generation.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131610467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moiz Arif, Kevin Assogba, M. M. Rafique, Sudharshan S. Vazhkudai
{"title":"Exploiting CXL-based Memory for Distributed Deep Learning","authors":"Moiz Arif, Kevin Assogba, M. M. Rafique, Sudharshan S. Vazhkudai","doi":"10.1145/3545008.3545054","DOIUrl":"https://doi.org/10.1145/3545008.3545054","url":null,"abstract":"Deep learning (DL) is being widely used to solve complex problems in scientific applications from diverse domains, such as weather forecasting, medical diagnostics, and fluid dynamics simulation. DL applications consume a large amount of data using large-scale high-performance computing (HPC) systems to train a given model. These workloads have large memory and storage requirements that typically go beyond the limited amount of main memory available on an HPC server. This significantly increases the overall training time as the input training data and model parameters are frequently swapped to slower storage tiers during the training process. In this paper, we use the latest advancements in the memory subsystem, specifically Compute Express Link (CXL), to provide additional memory and fast scratch space for DL workloads to reduce the overall training time while enabling DL jobs to efficiently train models using data that is much larger than the installed system memory. We propose a framework, called DeepMemoryDL, that manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, and provides intelligent prefetching and caching mechanisms for DL workloads. We implement and integrate DeepMemoryDL with a popular DL platform, TensorFlow, to show that our approach reduces read and write latencies, improves the overall I/O throughput, and reduces the training time. Our evaluation shows a performance improvement of up to 34% and 27% compared to the default TensorFlow platform and CXL-based memory expansion approaches, respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"333 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122978552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache-Poll: Containing Pollution in Non-Inclusive Caches Through Cache Partitioning","authors":"Lucía Pons, J. Sahuquillo, S. Petit, Julio Pons","doi":"10.1145/3545008.3545083","DOIUrl":"https://doi.org/10.1145/3545008.3545083","url":null,"abstract":"Current server processors have redistributed the cache hierarchy space over previous generations. The private L2 cache has been made larger and the shared last level caches (LLC) smaller but designed as non-inclusive to reduce the number of replicated blocks. As a result, the new organization shrinks the per-core cache area. Cache management in this organization becomes more critical than in inclusive caches due to two main reasons: there is less storage capacity per core both in the L3 and when considering the sum of L2 and L3 cache sizes, and there is higher L2-L3 traffic especially when running high cache-demanding applications. This paper focuses on minimizing L3 cache pollution to make a more efficient use of the limited space. Three main types of pollution are identified and measured: useless prefetches, bad speculated loads, and poor locality. This paper proposes Cache-Poll, a pollution-aware management policy that concentrates on limiting the cache space to polluting and L3 insensitive applications, allowing critical applications occupy more space. Unlike state-of-the-art work on non-inclusive caches, Cache-Poll is able to improve performance in an Intel Xeon Scalable processor even when running heavy cache-demanding workloads, consisting of 12-application workloads, as many applications as cores in the processor. Results show that Cache-Poll improves fairness and turnaround time by 44% and 24%, respectively, over the Linux OS, while even improving performance up to 3.5%.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121632120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengyuan Ye, Liekang Zeng, Qiong Wu, Ke Luo, Qingze Fang, Xu Chen
{"title":"Eco-FL: Adaptive Federated Learning with Efficient Edge Collaborative Pipeline Training","authors":"Shengyuan Ye, Liekang Zeng, Qiong Wu, Ke Luo, Qingze Fang, Xu Chen","doi":"10.1145/3545008.3545015","DOIUrl":"https://doi.org/10.1145/3545008.3545015","url":null,"abstract":"Federated Learning (FL) has been a promising paradigm in distributed machine learning that enables in-situ model training and global model aggregation. While it can well preserve private data for end users, to apply it efficiently on IoT devices yet suffer from their inherent variants: their available computing resources are typically constrained, heterogeneous, and changing dynamically. Existing works deploy FL on IoT devices by pruning a sparse model or adopting a tiny counterpart, which alleviates the workload but may have negative impacts on model accuracy. To address these issues, we propose Eco-FL, a novel Edge Collaborative pipeline based Federated Learning framework. On the client side, each IoT device collaborates with trusted available devices in proximity to perform pipeline training, enabling local training acceleration with efficient augmented resource orchestration. On the server side, Eco-FL adopts a novel grouping-based hierarchical architecture that combines synchronous intra-group aggregation and asynchronous inter-group aggregation, where a heterogeneity-aware dynamic grouping strategy that jointly considers response latency and data distribution is developed. To tackle the resource fluctuation during the runtime, Eco-FL further applies an adaptive scheduling policy to judiciously adjust workload allocation and client grouping at different levels. Extensive experimental results using both prototype and simulation show that, compared to state-of-the-art methods, Eco-FL can upgrade the training accuracy by up to 26.3%, reduce the local training time by up to 61.5%, and improve the local training throughput by up to 2.6 ×.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115533887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengya Lei, Fang Wang, D. Feng, Xiao‐Qian Shuai, Yu Cao
{"title":"A Dynamic and Recoverable BMT Scheme for Secure Non-Volatile Memory","authors":"Mengya Lei, Fang Wang, D. Feng, Xiao‐Qian Shuai, Yu Cao","doi":"10.1145/3545008.3545061","DOIUrl":"https://doi.org/10.1145/3545008.3545061","url":null,"abstract":"Data security is a key issue that non-volatile memory (NVM) system designers must consider. However, this is challenging because implementing security mechanisms such as bonsai merkle tree (BMT) in NVM needs to ensure crash recovery due to the non-volatile property of NVM. Existing schemes fail to efficiently guarantee the atomic BMT root update and instant system recovery required for BMT crash recovery, resulting in large write traffic and performance overhead. In this paper, we propose DR-TREE, a dynamic and recoverable BMT scheme for secure NVM, which reduces the update overhead of BMT root and achieves fast crash recovery with low write traffic. DR-TREE dynamically builds BMT and adjusts the updated BMT levels according to memory write requests, thus reducing unnecessary update overhead of BMT root. Next, based on the locality of memory write requests, DR-TREE merges repeated BMT updates, further decreasing the update overhead of BMT root. Moreover, DR-TREE achieves fast crash recovery with extremely low write traffic by delaying the partial recovery process. Experiments show that compared to the state-of-the-art design, DR-TREE improves the performance by 44.6%, decreases write traffic by 78.2% and achieves the system recovery in 5ms.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"223 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120932409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Latency-Sensitive DNN Inference via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge","authors":"Zhaowu Huang, Fang Dong, Dian Shen, Huitian Wang, Xiaolin Guo, Shucun Fu","doi":"10.1145/3545008.3545071","DOIUrl":"https://doi.org/10.1145/3545008.3545071","url":null,"abstract":"Nowadays, edge computing is widely adopted to resolve the emerging deep neural networks (DNNs)-driven intelligence scenarios with the requirement of low-latency and high-accuracy, which includes heterogeneous end devices and DNNs. In such scenarios, the influx of data and computation into a shared edge server incurs prohibitive latency. Thus, we exploit the advantage of Multi-exit DNNs (ME-DNNs) that tasks can exit early at appropriate depths to save inference time. However, naively using ME-DNNs in the heterogeneous edge still fails to deliver fast inference due to improper model surgery and resource allocation. In this paper, we propose an Acceleration scheme for Inference based on ME-DNNs with Adaptive model surgery and resource allocation (AIMA) to accelerate DNN inferences. We model this problem as a mixed-integer programming problem that involves jointly optimizing model surgery and resource allocation to minimize the task completion time. We first determine the optimal resource allocation policy with a given model surgery decision profile, and then the model surgery decision-making is modeled as a weighted congestion game. We prove the existence of the Nash equilibrium and propose a decentralized algorithm. Extensive experimental results show that AIMA significantly outperforms the state-of-the-art methods, achieving up to 6.01 × speedup.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129448800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}