Supporting Address Translation for Accelerator-Centric Architectures

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI:10.1109/HPCA.2017.19

Y. Hao, Zhenman Fang, Glenn D. Reinman, J. Cong

{"title":"Supporting Address Translation for Accelerator-Centric Architectures","authors":"Y. Hao, Zhenman Fang, Glenn D. Reinman, J. Cong","doi":"10.1109/HPCA.2017.19","DOIUrl":null,"url":null,"abstract":"While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety—which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"65","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 65

Abstract

While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety—which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.

查看原文本刊更多论文

支持以加速器为中心的体系结构的地址转换

虽然新兴的以加速器为中心的体系结构提供了数量级的性能和能源改进，但用例和采用可能受到其严格的编程模型的限制。主机CPU内核和自定义加速器之间的统一虚拟地址空间可以极大地提高可编程性，这就需要硬件支持地址转换。然而，支持低开销的定制加速器的地址转换并非易事。先前的研究要么假设TLB的大小是无穷大的，并且没有页遍行延迟，要么依赖于缓慢的IOMMU来保证正确性和安全性——这会影响整个系统的性能。为了为以加速器为中心的架构提供有效的地址转换支持，我们研究了定制加速器的内存访问行为，以驱动TLB增强和MMU设计。首先，为了支持在定制加速器的刮记存储器和内存系统之间连续数据的批量传输，我们提出了一个相对较小的私有TLB设计，为每个加速器提供低延迟的翻译缓存。其次，为了弥补广泛使用的数据平铺技术的影响，我们设计了一个共享的二级TLB来服务公共虚拟页面上的私有TLB缺失，消除了在映射到相同物理页面的相邻数据平铺上工作的加速器的重复页行走。这种两级TLB设计平均有效地减少了75.8%的页面行走。最后，我们建议不实现引入额外硬件复杂性的专用MMU，而是简单地利用主机每核MMU来进行有效的页遍历处理。这种机制是基于我们的见解，即CPU MMU中现有的MMU缓存能够以最小的开销满足定制加速器的需求。我们的评估表明，与理想的地址转换相比，组合方法只会产生6.4%的性能开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量