DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) Pub Date : 2022-03-01 DOI:10.1109/pdp55904.2022.00030

David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González

{"title":"DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering","authors":"David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/pdp55904.2022.00030","DOIUrl":null,"url":null,"abstract":"Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/pdp55904.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.

查看原文本刊更多论文

动态纹理映射- nuca节能图形渲染

现代移动gpu集成了越来越多的着色器内核来加速图形工作负载的执行。每个核心都集成了一个私有的纹理缓存，用于在对象上应用纹理效果，这是由一个共享的L2缓存备份的。然而，就像在任何其他内存层次结构中一样，这种组织在上层(即私有纹理缓存)中产生数据复制，以牺牲其整体有效容量为代价来实现更快的访问。例如，在具有四个着色器内核的移动GPU中，大约84.6%的请求纹理块在至少一个其他私有纹理缓存中被复制。针对移动GPU的私有纹理缓存，提出了一种新的动态映射非统一缓存架构(NUCA)组织，旨在通过攻击数据复制来提高纹理缓存的整体有效容量和降低整体访问延迟。在本地纹理缓存中丢失的块可以由远程缓存提供服务，其成本小于到共享L2的往返。提议的动态纹理映射-NUCA (DTM-NUCA)具有一个轻量级的映射表，称为亲和表，它与L2缓存大小无关，不像传统的NUCA组织。给定一组块的最佳所有者是动态确定的，并存储在Affinity Table中，以最大化本地访问。该机制还允许一定数量的复制，以便在适当的时候支持本地访问，而不会因为允许的复制导致的小容量损失而损害性能。DTM-NUCA以两种方式呈现。一个具有集中式关联表，另一个具有分布式关联表。实验结果表明:首先L2压力得到有效降低，平均消除了41.8%的L2访问;至于平均延迟，DTM-NUCA在最大化本地访问而不是远程访问方面执行了非常有效的工作，平均实现了73.8%的本地访问。因此，与传统组织相比，我们的新型DTM-NUCA组织获得了16.9%的平均加速和7.6%的总体节能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

自引率

0.00%

发文量