2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第2页

Arbitration Policies for On-Demand User-Level I/O Forwarding on HPC Platforms HPC平台按需用户级I/O转发仲裁策略

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00066

J. L. Bez, Alberto Miranda, Ramon Nou, F. Boito, Toni Cortes, P. Navaux

{"title":"Arbitration Policies for On-Demand User-Level I/O Forwarding on HPC Platforms","authors":"J. L. Bez, Alberto Miranda, Ramon Nou, F. Boito, Toni Cortes, P. Navaux","doi":"10.1109/IPDPS49936.2021.00066","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00066","url":null,"abstract":"I/O forwarding is a well-established and widely-adopted technique in HPC to reduce contention in the access to storage servers and transparently improve I/O performance. Rather than having applications directly accessing the shared parallel file system, the forwarding technique defines a set of I/O nodes responsible for receiving application requests and forwarding them to the file system, thus reshaping the flow of requests. The typical approach is to statically assign I/O nodes to applications depending on the number of compute nodes they use, which is not always necessarily related to their I/O requirements. Thus, this approach leads to inefficient usage of these resources. This paper investigates arbitration policies based on the applications I/O demands, represented by their access patterns. We propose a policy based on the Multiple-Choice Knapsack problem that seeks to maximize global bandwidth by giving more I/O nodes to applications that will benefit the most. Furthermore, we propose a user-level I/O forwarding solution as an on-demand service capable of applying different allocation policies at runtime for machines where this layer is not present. We demonstrate our approach’s applicability through extensive experimentation and show it can transparently improve global I/O bandwidth by up to 85% in a live setup compared to the default static policy.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123082386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

EAGLE: Expedited Device Placement with Automatic Grouping for Large Models EAGLE:快速设备放置与自动分组大型模型

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00068

Hao Lan, Li Chen, Baochun Li

{"title":"EAGLE: Expedited Device Placement with Automatic Grouping for Large Models","authors":"Hao Lan, Li Chen, Baochun Li","doi":"10.1109/IPDPS49936.2021.00068","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00068","url":null,"abstract":"Advanced deep neural networks with large sizes are usually trained on a mixture of devices, including multiple CPUs and GPUs. The model training speed and efficiency are drastically impacted by the placement of operations on devices. To identify the optimal device placement, the state-of-the-art method is based on reinforcement learning with a hierarchical model, which partitions the operations into groups and then assigns each group to specific devices. However, due to the additional dimension of grouping decisions coupled with the placement, the reinforcement learning efficiency is greatly reduced. With modern neural networks growing in size and complexity, the issue of low efficiency and high cost in device placement is further aggravated. In this paper, we propose our design of EAGLE (Expedited Automatic Grouping for Large modEls), which integrates automatic grouping into reinforcement learning-based placement in an optimal way, to achieve the best possible training time performance for very large models. An extra RNN is introduced to transform parameters of the grouper into inputs of the placer, linking the originally separated parts together. Further optimizations have also been made in the network inputs. We have deployed and extensively evaluated EAGLE on InceptionV3, GNMT and BERT benchmarks. Compared with the state-of-the-art, the performance achieved by our design, measured by the per-step time with the resulted placement, is 2.7% and 18.7% better for GNMT and BERT, respectively. For Inception-V3, our design achieves the fastest speed in discovering the optimal placement.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131364990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

AuTraScale: An Automated and Transfer Learning Solution for Streaming System Auto-Scaling AuTraScale:流系统自动缩放的自动化和迁移学习解决方案

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00100

Liang Zhang, Wenli Zheng, Chao Li, Yao Shen, M. Guo

{"title":"AuTraScale: An Automated and Transfer Learning Solution for Streaming System Auto-Scaling","authors":"Liang Zhang, Wenli Zheng, Chao Li, Yao Shen, M. Guo","doi":"10.1109/IPDPS49936.2021.00100","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00100","url":null,"abstract":"The complexity and variability of streaming data have brought a great challenge to the elasticity of the data processing systems. Streaming systems, such as Flink and Storm, need to adapt to the changes of workload with auto-scaling to meet the QoS requirements while saving resources. However, the accuracy of classical models (such as a queueing model) for QoS prediction decreases with the increase of the complexity and variability of streaming data and the resource interference. On the other hand, the indirect metrics used to optimize QoS may not accurately guide resource adjustment. Those problems can easily lead to waste of resources or QoS violation in practice. To solve the above problems, we propose AuTraScale, an automated and transfer learning auto-scaling solution, to determine the appropriate parallelism and resource allocation that meet the latency and throughput targets. AuTraScale uses Bayesian optimization to adapt to the complex relationship between resources and QoS, minimizing the impact of resource interference on the prediction accuracy, and a new metric that measures the performance of operators for accurate optimization. Even when the input data rate changes, it can quickly adjust the parallelism of each operator in response, with a transfer learning algorithm. We have implemented and evaluated AuTraScale on a Flink platform. The experimental results show that, compared with the state-of-the-art method like DRS and DS2, AuTraScale can reduce 66.6% and 36.7% resource consumption respectively in the scale-down and scale-up scenarios while ensuring QoS requirements, and save 13.5% resource on average when the input data rate changes.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133807278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

QPR: Quantizing PageRank with Coherent Shared Memory Accelerators QPR:用相干共享内存加速器量化PageRank

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00105

A. Mughrabi, Mohannad Ibrahim, G. Byrd

{"title":"QPR: Quantizing PageRank with Coherent Shared Memory Accelerators","authors":"A. Mughrabi, Mohannad Ibrahim, G. Byrd","doi":"10.1109/IPDPS49936.2021.00105","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00105","url":null,"abstract":"Graph algorithms often require fine-grained, random access across substantially large data structures. Previous work on FPGA-based acceleration has required significant preprocessing and restructuring to transform the memory access patterns into a streaming format that is more friendly to of fchip hardware. However, the emergence of cache-coherent shared memory interfaces, such as CAPI, allows designers to more easily work with the natural in-memory organization of the data. This paper introduces a vertex-centric shared-memory accelerator for the PageRank algorithm, optimized for high performance while effectively using coherent caching on the FPGA hardware. The proposed design achieves up to 14.9x speedups by selectively caching graph data for the accelerator while taking into account locality and reuse, compared to naively using the shared address space access and DRAM only. We also introduce PageRank Quantization, an innovative technique to represent page-ranks with 32-bit quantized fixed-point values. This approach is up to 1.5x faster than 64-bit fixed-point while keeping precision within a tolerable error margin. As a result, we maintain both the hardware scalability of fixed-point representation and the cache performance of 32-bit floating-point.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130370085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards Practical Cloud Offloading for Low-cost Ground Vehicle Workloads 面向低成本地面车辆负载的实用云卸载

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00083

Yuan Xu, Tianwei Zhang, Jimin Han, Sa Wang, Yungang Bao

{"title":"Towards Practical Cloud Offloading for Low-cost Ground Vehicle Workloads","authors":"Yuan Xu, Tianwei Zhang, Jimin Han, Sa Wang, Yungang Bao","doi":"10.1109/IPDPS49936.2021.00083","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00083","url":null,"abstract":"Low-cost Ground Vehicles (LGVs) have been widely adopted to conduct various tasks in our daily life. However, the limited on-board battery capacity and computation resources prevent LGVs from taking more complex and intelligent workloads. A promising approach is to offload the computation from local LGVs to remote servers. However, current cloud-robotic research and platforms are still at a very early stage. Compared to other systems and devices, optimizing LGV workload offloading faces more challenges, such as the uncertainty of environments and the mobility feature of devices.In this paper, we explore the opportunities of optimizing cloud offloading of LGV workloads from the perspectives of performance, energy efficiency and network robustness. We first build an analytical model to reveal the computation role and impact of each function in LGV workloads. Then we propose several optimization strategies (fine-grained migration, cloud acceleration, real-time monitoring and adjustment) to accelerate workload computation, reduce on-board energy consumption, and increase the network robustness. We implement an end-to-end cloud-robotic framework with such strategies to achieve dynamic and adaptive offloading. Evaluations on physical LGVs show that our strategies can significantly reduce the total energy consumption by 2.12× and mission completion time by 2.53×, and maintain strong robust ness under poor network quality.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123837702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs TileSpMV: gpu上稀疏矩阵向量乘法的平铺算法

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00016

Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, Guangming Tan

{"title":"TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs","authors":"Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, Guangming Tan","doi":"10.1109/IPDPS49936.2021.00016","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00016","url":null,"abstract":"With the extensive use of GPUs in modern supercomputers, accelerating sparse matrix-vector multiplication (SpMV) on GPUs received much attention in the last couple of decades. A number of techniques, such as increasing utilization of wide vector units, reducing load imbalance and selecting the best formats, have been developed. However, the 2D spatial sparsity structure has not been well exploited in the existing work for SpMV on GPUs. In this paper, we propose an efficient tiled algorithm called TileSpMV for optimizing SpMV on GPUs through exploiting 2D spatial structure of sparse matrices. We first implement seven warp-level SpMV methods for calculating sparse tiles stored in a variety of formats, and then design a selection method to find the best format and SpMV implementation for each tile. We also adaptively extract nonzeros in the very sparse tiles into a separate matrix to maximize the overall performance. The experimental results show that our method is faster than state-of-the-art SpMV methods such as Merge-SpMV, CSR5 and BSR in most matrices of the full SuiteSparse Matrix Collection and delivers up to 2.61x, 3.96x and 426.59x speedups, respectively.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128633623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

RVMA: Remote Virtual Memory Access RVMA:远程虚拟内存访问

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00029

Ryan E. Grant, M. Levenhagen, Matthew G. F. Dosanjh, Patrick M. Widener

{"title":"RVMA: Remote Virtual Memory Access","authors":"Ryan E. Grant, M. Levenhagen, Matthew G. F. Dosanjh, Patrick M. Widener","doi":"10.1109/IPDPS49936.2021.00029","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00029","url":null,"abstract":"Remote Direct Memory Access (RDMA) capabilities have been provided by high-end networks for many years, but the network environments surrounding RDMA are evolving. RDMA performance has historically relied on using strict ordering guarantees to determine when data transfers complete, but modern adaptively-routed networks no longer provide those guarantees. RDMA also exposes low-level details about memory buffers: either all clients are required to coordinate access using a single shared buffer, or exclusive resources must be allocatable per-client for an unbounded amount of time. This makes RDMA unattractive for use in many-to-one communication models such as those found in public internet client-server situations.Remote Virtual Memory Access (RVMA) is a novel approach to data transfer which adapts and builds upon RDMA to provide better usability, resource management, and fault tolerance. RVMA provides a lightweight completion notification mechanism which addresses RDMA performance penalties imposed by adaptively-routed networks, enabling high-performance data transfer regardless of message ordering. RVMA also provides receiver-side resource management, abstracting away previously-exposed details from the sender-side and removing the RDMA requirement for exclusive/coordinated resources. RVMA requires only small hardware modifications from current designs, provides performance comparable or superior to traditional RDMA networks, and offers many new features.In this paper, we describe RVMA’s receiver-managed resource approach and how it enables a variety of new data-transfer approaches on high-end networks. In particular, we demonstrate how an RVMA NIC could implement the first hardware-based fault tolerant RDMA-like solution. We present the design and validation of an RVMA simulation model in a popular simulation suite and use it to evaluate the advantages of RVMA at large scale. In addition to support for adaptive routing and easy programmability, RVMA can outperform RDMA on a 3D sweep application by 4.4X.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124967697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores 使用GPU张量内核加速非2次幂大小的傅里叶变换

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00059

Louis Pisha, Lukasz Ligowski

{"title":"Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores","authors":"Louis Pisha, Lukasz Ligowski","doi":"10.1109/IPDPS49936.2021.00059","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00059","url":null,"abstract":"Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes—which may be prime or have large prime factors—is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing DFT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FFT computation which meets the accuracy and exceeds the speed of the state of the art.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems 基于蜻蜓的生产系统自适应路由性能评价

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00042

Sudheer Chunduri, K. Harms, Taylor L. Groves, P. Mendygral, Justs Zarins, M. Weiland, Yasaman Ghadar

{"title":"Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems","authors":"Sudheer Chunduri, K. Harms, Taylor L. Groves, P. Mendygral, Justs Zarins, M. Weiland, Yasaman Ghadar","doi":"10.1109/IPDPS49936.2021.00042","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00042","url":null,"abstract":"Performance of applications in production environments can be sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a packet traverses the network. Software can dictate different routing policies, adjusting between minimal and non-minimal bias, for each posted message. We have extensively evaluated the sensitivity of the routing bias selection on application performance as well as whole system performance in both production and controlled conditions. We show that the default routing bias used in Aries-based systems is often sub-optimal and that using a higher bias towards minimal routes will not only reduce the congestion effects on the application but also will decrease the overall congestion on the network. This routing scheme results in not only improved mean performance (by up to 12%) of most production applications but also reduced run-to-run variability. Our study prompted the two supercomputing facilities (ALCF and NERSC) to change the default routing mode on their Aries-based systems. We present the substantial improvement measured in the overall congestion management and interconnect performance in production after making this change.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114134825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis 揭秘GPU UVM成本与深度运行时和工作负载分析

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00023

Tyler N. Allen, Rong Ge

{"title":"Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis","authors":"Tyler N. Allen, Rong Ge","doi":"10.1109/IPDPS49936.2021.00023","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00023","url":null,"abstract":"With GPUs becoming ubiquitous in HPC systems, NVIDIA’s Unified Virtual Memory (UVM) is being adopted as a measure to simplify porting of complex codes to GPU platforms by allowing demand paging between host and device memory without programmer specification. Much like its storage-based counterparts, UVM provides a great deal of added usability at the cost of performance due to the abstraction and fault-handling mechanisms. This is preventing HPC systems from being used efficiently and effectively and decreases the overall value of GPU-based systems.To mitigate the cost of page fault stall time, NVIDIA has introduced a prefetching mechanism to their UVM system. This prefetcher infers data ahead-of-time based on prior page fault history, hoping to satisfy faults before they occur. Such a prefetcher must be cleverly designed and efficient, as it operates under the constraints of a realtime system for providing effective service. Additionally, the workload is quite complex due to the parallel nature of GPU faults, as well as page fault serialization and fault source erasure within the driver. The current prefetching mechanism uses a density-prefetching algorithm to offset the side-effects of receiving page faults in parallel. While this prefetching can be very effective, it also has a negative impact on the performance of GPU oversubscription.In this paper, we provide a deep analysis of the overhead caused by UVM and the primary sources of this overhead. Additionally, we analyze the impact of NVIDIA’s prefetching and oversubscription in practice on different workloads, and correlate the performance to the driver implementation and prefetching mechanism. We provide design insights and improvement suggestions for hardware and middleware that would provide new avenues for performance gain.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114871940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7