Rongxin Han, Deliang Chen, Song Guo, Xiaoyuan Fu, Jingyu Wang, Q. Qi, J. Liao
{"title":"Parallel Network Slicing for Multi-SP Services","authors":"Rongxin Han, Deliang Chen, Song Guo, Xiaoyuan Fu, Jingyu Wang, Q. Qi, J. Liao","doi":"10.1145/3545008.3545070","DOIUrl":"https://doi.org/10.1145/3545008.3545070","url":null,"abstract":"Network slicing is rapidly prevailing in edge cloud, which provides computing, network and storage resources for various services. When the multiple service providers (SPs) respond to their tenants in parallel, individual decisions on the dynamic and shared edge cloud may lead to resource conflicts. The resource conflicts problem can be formulated as a multi-objective constrained optimization model; however, it is challenging to solve it due to the complexity of resource interactions caused by co-existing multi-SP policies. Therefore, we propose a CommDRL scheme based on multi-agent deep reinforcement learning (MADRL) and multi-agent communication to tackle the challenge. CommDRL can coordinate network resources between SPs with less overhead. Moreover, we design the neurons hotplugging learning in CommDRL to deal with dynamic edge cloud, which realizes scalability without a high cost of model retraining. Experiments demonstrate that CommDRL can successfully obtain deployment policies and easily adapt to various network scales. It improves the accepted requests by 7.4%, reduces resource conflicts by 14.5%, and shortens the model convergence time by 83.3%.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132663012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Liu, Mingzhu Shen, Ruihao Gong, F. Yu, Hailong Yang
{"title":"NNLQP: A Multi-Platform Neural Network Latency Query and Prediction System with An Evolving Database","authors":"Liang Liu, Mingzhu Shen, Ruihao Gong, F. Yu, Hailong Yang","doi":"10.1145/3545008.3545051","DOIUrl":"https://doi.org/10.1145/3545008.3545051","url":null,"abstract":"Deep neural networks (DNNs) are widely used in various applications. The accurate and latency feedback is essential for model design and deployment. In this work, we attempt to alleviate the cost of model latency acquisition from two aspects: latency query and latency prediction. To ease the difficulty of acquiring model latency on multi-platform, our latency query system can automatically convert DNN model into the corresponding executable format, and measure latency on the target hardware. Powered by this, latency queries can be fulfilled with a simple interface calling. For the efficient utilization of previous latency knowledge, we employ a MySQL database to store numerous models and the corresponding latencies. In our system, the efficiency of latency query can be boosted by 1.8 ×. For latency prediction, we first represent neural networks with the unified GNN-based graph embedding. With the help of the evolving database, our model-based latency predictor achieves better performance, which realizes 12.31% accuracy improvement compared with existing methods. Our codes are open-sourced at https://github.com/ModelTC/NNLQP.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128714411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu
{"title":"Characterizing and Optimizing Transformer Inference on ARM Many-core Processor","authors":"Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu","doi":"10.1145/3545008.3545022","DOIUrl":"https://doi.org/10.1145/3545008.3545022","url":null,"abstract":"Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115660988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph Izraelevitz, Gaukas Wang, Rhett Hanscom, Kayli Silvers, T. Lehman, G. Chockler, Alexey Gotsman
{"title":"Acuerdo: Fast Atomic Broadcast over RDMA","authors":"Joseph Izraelevitz, Gaukas Wang, Rhett Hanscom, Kayli Silvers, T. Lehman, G. Chockler, Alexey Gotsman","doi":"10.1145/3545008.3545041","DOIUrl":"https://doi.org/10.1145/3545008.3545041","url":null,"abstract":"Atomic broadcast protocols ensure that messages are delivered to a group of machines in some total order, even when some of these machines can fail. These protocols are key to making distributed services fault-tolerant, as their total order guarantee allows keeping multiple service replicas in sync. But, unfortunately, atomic broadcast protocols are also notoriously expensive. We present a new protocol, called Acuerdo, that improves atomic broadcast performance by using remote direct memory addressing (RDMA). Acuerdo is built from the ground up to perform communication using one-side RDMA writes, which do not use the CPU of the remote machine, and is explicitly designed to minimize waiting on the critical path. Our experimental results demonstrate that Acuerdo provides raw throughput comparable to or exceeding other RDMA atomic broadcast protocols, while improving latency by almost 2x.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"4052 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127552007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Chen, Qi Yu, Fang Zheng, F. Guo, Zuoning Chen
{"title":"DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training","authors":"Zheng Chen, Qi Yu, Fang Zheng, F. Guo, Zuoning Chen","doi":"10.1145/3545008.3545086","DOIUrl":"https://doi.org/10.1145/3545008.3545086","url":null,"abstract":"Ever-growing CNN size incurs a significant amount of redundancy in model parameters, which in turn, puts considerable burden on hardware. Unstructured pruning is widely used to reduce model sparsity. While, the irregularity introduced by unstructured pruning makes it difficult to accelerate sparse CNNs on systolic array. To address this issue, a variety of accelerators have been proposed. SIGMA, the state-of-the-art sparse GEMM accelerator, achieves significant speedup over systolic array. However, SIGMA suffers from two disadvantages: 1) it only supports one-side sparsity, leaving potential for further performance gains; 2) SIGMA improves utilization of large-sized systolic arrays at the cost of extra overhead. In this paper, we propose DSSA, a dual-side sparse systolic array, to accelerate CNN training. DSSA bases its designs on a small-sized systolic array, which naturally achieves higher cell utilization without additional overhead. To facilitate dual-side sparsity processing, DSSA utilizes a cross-cycle reduction module to accumulate partial sum that belongs to the same column but being processed in different cycles. A comprehensive design space exploration is performed to seek the local optimal configurations for DSSA. We implement the logic design of DSSA using Verilog in RTL and evaluate its performance using a C++-based cycle-accurate performance simulator we built. Experimental results show that DSSA delivers, on average, a speedup of 2.13x and 13.81x over SIGMA and a basic systolic array with the same number of cells. Compared to SIGMA, DSSA incurs 16.59% area overhead and 25.49% power overhead when sparse filter is excluded, as SIGMA did.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127054972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cunyang Wei, Haipeng Jia, Yunquan Zhang, Liusha Xu, Ji Qi
{"title":"IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs","authors":"Cunyang Wei, Haipeng Jia, Yunquan Zhang, Liusha Xu, Ji Qi","doi":"10.1145/3545008.3545032","DOIUrl":"https://doi.org/10.1145/3545008.3545032","url":null,"abstract":"Recently the mainstream basic linear algebra libraries have delivered high performance on large scale General Matrix Multiplication(GEMM) and Triangular System Solve(TRSM). However, these libraries are still insufficient to provide sustained performance for batch operations on large groups of fixed-size small matrices on specific architectures, which are extensively used in various scientific computing applications. In this paper, we propose IATF, an input-aware tuning framework for optimizing large group of fixed-size small GEMM and TRSM to boost near-optimal performance on ARMv8 architecture. The IATF contains two stages: install-time stage and run-time stage. In the install-time stage, based on SIMD-friendly data layout, we propose computing kernel templates for high-performance GEMM and TRSM, analyze optimal kernel sizes to increase computational instruction ratio, and design kernel optimization strategies to improve kernel execution efficiency. Furthermore, an optimized data packing strategy is also presented for computing kernels to minimize the cost of memory accessing overhead. In the run-time stage, we present an input-aware tuning method to generate an efficient execution plan for large group of fixed-size small GEMM and TRSM, according to the input matrix properties. The experimental results show that IATF could achieve significant performance improvements in GEMM and TRSM compared with other mainstream BLAS libraries.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121577267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaomin Zou, Fang Wang, D. Feng, Tianjin Guan, Nan Su
{"title":"ROWE-tree: A Read-Optimized and Write-Efficient B+-tree for Persistent Memory","authors":"Xiaomin Zou, Fang Wang, D. Feng, Tianjin Guan, Nan Su","doi":"10.1145/3545008.3545043","DOIUrl":"https://doi.org/10.1145/3545008.3545043","url":null,"abstract":"Persistent memory (PM) exhibits a huge potential to provide B+-tree indexes with high performance, efficient persistence, and instant recovery. A large number of PM-optimized B+-tree indexes have been proposed, but most of them fail to provide high performance for both read and write operations because: (1) their designs of search optimization and insert improvement are often traded off against each other, and (2) they overlook the read/write interference problem of PM which incurs unpredictable performance degradation. In this paper, we propose ROWE-tree, a read-optimized and write-efficient B+-tree for PM. The designs of our ROWE-tree consist of three key points. First, we propose two techniques to make a good trade-off between write and read performance: self-verifying insertion, which reduces consistency overhead by using the key itself as a persist mark instead of additional metadata, and semi-sorted leaf nodes, which use append-only insertion to avoid the shifting overhead of sorting nodes but keep intra-cache-line items sorted to accelerate the lookup. Second, based on the observation that data accesses are highly skewed in real-world workloads, we build an in-DRAM cache of hot items to outsource accesses to hot items to DRAM. By doing so, we can alleviate the read/write interference of PM and significantly improve overall performance. Third, to cope with the dynamic changes of hot items, we exploit a lightweight mechanism to track such changes at run-time. Using Intel Optane DCPMM, our evaluations show that ROWE-tree obtains up to 3.86 × higher performance than the state-of-the-art PM B+-tree indexes under YCSB workloads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133906675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Stewart, Nathaniel Morris, L. Chen, R. Birke
{"title":"Performance Modeling for Short-Term Cache Allocation","authors":"Christopher Stewart, Nathaniel Morris, L. Chen, R. Birke","doi":"10.1145/3545008.3545094","DOIUrl":"https://doi.org/10.1145/3545008.3545094","url":null,"abstract":"Short-term cache allocation grants and then revokes access to processor cache lines dynamically. For online services, short-term allocation can speed up targeted query executions and free up cache lines reserved, but normally not needed, for performance. However, in collocated settings, short-term allocation can increase cache contention, slowing down collocated query executions. To offset slowdowns, collocated services may request short-term allocation more often, making the problem worse. Short-term allocation policies manage which queries receive cache allocations and when. In collocated settings, these policies should balance targeted query speedups against slowdowns caused by recurring cache contention. We present a model-driven approach that (1) predicts response time under a given policy, (2) explores competing policies and (3) chooses policies that yield low response time for all collocated services. Our approach profiles cache usage offline, characterizes the effects of cache allocation policies using deep learning techniques and devises novel performance models for short-term allocation with online services. We tested our approach using data processing, cloud, and high-performance computing benchmarks collocated on Intel processors equipped with Cache Allocation Technology. Our models predicted median response time with 11% absolute percent error. Short-term allocation policies found using our approach out performed state-of-the-art shared cache allocation policies by 1.2–2.3X.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133769302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingjing Ye, Lin Li, Wenlu Zhang, Guihao Chen, Yuanchao Shan, Yijun Li, Weihe Li, Jiawei Huang
{"title":"UA-Sketch: An Accurate Approach to Detect Heavy Flow based on Uninterrupted Arrival","authors":"Jingjing Ye, Lin Li, Wenlu Zhang, Guihao Chen, Yuanchao Shan, Yijun Li, Weihe Li, Jiawei Huang","doi":"10.1145/3545008.3545017","DOIUrl":"https://doi.org/10.1145/3545008.3545017","url":null,"abstract":"Heavy flow detection in enormous network traffic is a critical task for network measurement. Due to the limited memory size and high link capacity, accurate detection of heavy flows becomes challenging in large-scale networks. Almost all existing approaches of detecting heavy flows use single-dimension statistics of flow size to make flow-replacement decisions. However, under the mass number of small flows, the heavy flows are prone to be frequently and mistakenly replaced, resulting in unsatisfactory accuracy. To solve this problem, we reveal that the number of uninterrupted arrival packets is a useful metric in identifying flow types. We further propose UA-Sketch that expels small flows and protects heavy ones according to the multiple-dimension statistics including both estimated flow size and number of uninterrupted arrival packets. The test results of trace-driven simulations and OVS experiments show that, even under small memory, UA-Sketch achieves higher accuracy than the existing works, with the F1 Score by up to 2.1 ×.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132971829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Chaurasia, Anshuj Garg, B. Raman, Uday Kurkure, Hari Sivaraman, Lan Vu, S. Veeraswamy
{"title":"Simmer: Rate proportional scheduling to reduce packet drops in vGPU based NF chains","authors":"A. Chaurasia, Anshuj Garg, B. Raman, Uday Kurkure, Hari Sivaraman, Lan Vu, S. Veeraswamy","doi":"10.1145/3545008.3545068","DOIUrl":"https://doi.org/10.1145/3545008.3545068","url":null,"abstract":"Network Function Virtualization (NFV) paradigm offers flexibility, cost benefits, and ease of deployment by decoupling network function from hardware middleboxes. The service function chains (SFC) deployed using the NFV platform require efficient sharing of resources among various network functions in the chain. Graphics Processing Units (GPUs) have been used to improve various network functions’ performance. However, sharing a single GPU among multiple virtualized network functions (virtual machines) in a service function chain has been challenging due to their proprietary hardware and software stack. Earlier GPU architectures had a limitation: a single physical GPU can only be allocated to one virtual machine (VM) and cannot be shared among multiple VMs. The newer GPUs are virtualization-aware (hardware-assisted virtualization) and allow multiple virtual machines to share a single physical GPU. Although virtualization-aware, these GPUs still lack support for custom scheduling policies and do not expose the preemption control to users. When network functions (hosted within virtual machines) with different processing requirements share the same GPU, virtualization-aware GPUs’ default round-robin scheduling mechanism proves to be inefficient, resulting in packet drops and lower throughput. This paper presents Simmer, an efficient mechanism for scheduling a network function service chain on virtualization-aware GPUs. Our scheduling solution considers the processing requirement of NFs in a GPU-based SFC, thus improving overall throughput by up to 29% and reducing the packet drop to zero compared to vanilla setup.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"291 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125746854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}