Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00055

Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, A. Joshi, Norman Rubin, John Kim, D. Kaeli

{"title":"Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems","authors":"Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, A. Joshi, Norman Rubin, John Kim, D. Kaeli","doi":"10.1109/HPCA47549.2020.00055","DOIUrl":null,"url":null,"abstract":"As transistor scaling becomes increasingly more difficult to achieve, scaling the core count on a single GPU chip has also become extremely challenging. As the volume of data to process in today's increasingly parallel workloads continues to grow unbounded, we need to find scalable solutions that can keep up with this increasing demand. To meet the need of modern-day parallel applications, multi-GPU systems offer a promising path to deliver high performance and large memory capacity. However, multi-GPU systems suffer from performance issues associated with GPU-to-GPU communication and data sharing, which severely impact the benefits of multi-GPU systems. Programming multi-GPU systems has been made considerably simpler with the advent of Unified Memory which enables runtime migration of pages to the GPU on demand. Current multi-GPU systems rely on a first-touch Demand Paging scheme, where memory pages are migrated from the CPU to the GPU on the first GPU access to a page. The data sharing nature of GPU applications makes deploying an efficient programmer-transparent mechanism for inter-GPU page migration challenging. Therefore following the initial CPU-to-GPU page migration, the page is pinned on that GPU. Future accesses to this page from other GPUs happen at a cache-line granularity – pages are not transferred between GPUs without significant programmer intervention. We observe that this mechanism suffers from two major drawbacks: 1) imbalance in the page distribution across multiple GPUs, and 2) inability to move the page to the GPU that uses it most frequently. Both of these problems lead to load imbalance across GPUs, degrading the performance of the multi-GPU system. To address these problems, we propose Griffin, a holistic hardware-software solution to improve the performance of NUMA multi-GPU systems. Griffin introduces programmer-transparent modifications to both the IOMMU and GPU architecture, supporting efficient runtime page migration based on locality information. In particular, Griffin employs a novel mechanism to detect and move pages at runtime between GPUs, increasing the frequency of resolving accesses locally, which in turn improves the performance. To ensure better load balancing across GPUs, Griffin employs a Delayed First-Touch Migration policy that ensures pages are evenly distributed across multiple GPUs. Our results on a diverse set of multi-GPU workloads show that Griffin can achieve up to a 2.9× speedup on a multi-GPU system, while incurring low implementation overhead.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

As transistor scaling becomes increasingly more difficult to achieve, scaling the core count on a single GPU chip has also become extremely challenging. As the volume of data to process in today's increasingly parallel workloads continues to grow unbounded, we need to find scalable solutions that can keep up with this increasing demand. To meet the need of modern-day parallel applications, multi-GPU systems offer a promising path to deliver high performance and large memory capacity. However, multi-GPU systems suffer from performance issues associated with GPU-to-GPU communication and data sharing, which severely impact the benefits of multi-GPU systems. Programming multi-GPU systems has been made considerably simpler with the advent of Unified Memory which enables runtime migration of pages to the GPU on demand. Current multi-GPU systems rely on a first-touch Demand Paging scheme, where memory pages are migrated from the CPU to the GPU on the first GPU access to a page. The data sharing nature of GPU applications makes deploying an efficient programmer-transparent mechanism for inter-GPU page migration challenging. Therefore following the initial CPU-to-GPU page migration, the page is pinned on that GPU. Future accesses to this page from other GPUs happen at a cache-line granularity – pages are not transferred between GPUs without significant programmer intervention. We observe that this mechanism suffers from two major drawbacks: 1) imbalance in the page distribution across multiple GPUs, and 2) inability to move the page to the GPU that uses it most frequently. Both of these problems lead to load imbalance across GPUs, degrading the performance of the multi-GPU system. To address these problems, we propose Griffin, a holistic hardware-software solution to improve the performance of NUMA multi-GPU systems. Griffin introduces programmer-transparent modifications to both the IOMMU and GPU architecture, supporting efficient runtime page migration based on locality information. In particular, Griffin employs a novel mechanism to detect and move pages at runtime between GPUs, increasing the frequency of resolving accesses locally, which in turn improves the performance. To ensure better load balancing across GPUs, Griffin employs a Delayed First-Touch Migration policy that ensures pages are evenly distributed across multiple GPUs. Our results on a diverse set of multi-GPU workloads show that Griffin can achieve up to a 2.9× speedup on a multi-GPU system, while incurring low implementation overhead.

查看原文本刊更多论文

多gpu系统中高效页面迁移的硬件软件支持

随着晶体管的缩放变得越来越难以实现，在单个GPU芯片上缩放核心数量也变得极具挑战性。随着当今日益并行的工作负载中需要处理的数据量不断增长，我们需要找到可扩展的解决方案，以满足这种不断增长的需求。为了满足现代并行应用的需求，多gpu系统提供了一个有前途的途径来提供高性能和大内存容量。然而，多gpu系统受到与gpu到gpu通信和数据共享相关的性能问题的困扰，这严重影响了多gpu系统的优势。随着统一内存的出现，多GPU系统的编程变得相当简单，统一内存可以根据需要在运行时将页面迁移到GPU。当前的多GPU系统依赖于一触式需求分页方案，其中内存页面在第一个GPU访问页面时从CPU迁移到GPU。GPU应用程序的数据共享特性使得为GPU间页面迁移部署高效的程序员透明机制具有挑战性。因此，在初始的cpu到GPU的页面迁移之后，该页面被固定在该GPU上。将来从其他gpu访问该页发生在缓存线粒度上——如果没有显著的程序员干预，页面不会在gpu之间传输。我们观察到这种机制有两个主要缺点:1)页面分布在多个GPU之间的不平衡，2)无法将页面移动到使用它最频繁的GPU。这两个问题都会导致gpu之间的负载不平衡，从而降低多gpu系统的性能。为了解决这些问题，我们提出了Griffin，一个整体的硬件软件解决方案，以提高NUMA多gpu系统的性能。Griffin为IOMMU和GPU架构引入了程序员透明的修改，支持基于位置信息的高效运行时页面迁移。特别是，Griffin采用了一种新的机制在gpu之间检测和移动页面，增加了本地解析访问的频率，从而提高了性能。为了确保gpu之间更好的负载平衡，Griffin采用了延迟首次接触迁移策略，确保页面在多个gpu之间均匀分布。我们在多种多gpu工作负载上的结果表明，Griffin可以在多gpu系统上实现高达2.9倍的加速，同时产生较低的实现开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量