过滤翻译带宽与虚拟缓存

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI:10.1145/3173162.3173195

Hongil Yoon, Jason Lowe-Power, G. Sohi

{"title":"过滤翻译带宽与虚拟缓存","authors":"Hongil Yoon, Jason Lowe-Power, G. Sohi","doi":"10.1145/3173162.3173195","DOIUrl":null,"url":null,"abstract":"Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Filtering Translation Bandwidth with Virtual Caching\",\"authors\":\"Hongil Yoon, Jason Lowe-Power, G. Sohi\",\"doi\":\"10.1145/3173162.3173195\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).\",\"PeriodicalId\":302876,\"journal\":{\"name\":\"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3173162.3173195\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3173162.3173195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

将GPU集成在与cpu相同的芯片上的异构计算是普遍存在的，并且为了增加可编程性，许多这些系统都支持来自GPU硬件的虚拟地址访问。然而，这需要在每次内存访问时进行地址转换。我们观察到，由于频繁的私有TLB丢失，未来的gpu和工作负载对共享地址转换硬件显示出非常高的带宽需求(在某些情况下每个周期多达4次访问)。这极大地影响了性能(相对于理想的MMU，平均性能下降32%)。为了减少这种开销，我们提出了一个软件无关的、实用的GPU虚拟缓存层次结构。我们使用虚拟缓存层次结构作为有效的地址转换带宽过滤器。我们观察到许多在私有tlb中丢失的请求在GPU缓存层次结构中找到相应的有效数据。使用GPU虚拟缓存层次结构，可以过滤这些TLB未命中(即虚拟缓存命中)，从而显着降低共享地址转换硬件的带宽需求。此外，gpu的特定于加速器的属性(例如，更少的同义词的可能性)降低了虚拟缓存的设计复杂性，使整个虚拟缓存层次结构(包括共享L2缓存)对gpu实用。我们的评估表明，整个GPU虚拟缓存层次结构有效地过滤了高地址转换带宽，实现了与理想MMU几乎相同的性能。我们还评估了仅l1的虚拟缓存设计，并表明使用整个虚拟缓存层次结构可以获得额外的性能优势(平均加速1.31倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Filtering Translation Bandwidth with Virtual Caching

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

自引率

0.00%

发文量