LTRF:通过硬件/软件协同寄存器预取为gpu启用高容量寄存器文件

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI:10.1145/3173162.3173211

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu

{"title":"LTRF:通过硬件/软件协同寄存器预取为gpu启用高容量寄存器文件","authors":"Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu","doi":"10.1145/3173162.3173211","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":"{\"title\":\"LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching\",\"authors\":\"Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu\",\"doi\":\"10.1145/3173162.3173211\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.\",\"PeriodicalId\":302876,\"journal\":{\"name\":\"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"40\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3173162.3173211\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3173162.3173211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

摘要

图形处理单元(gpu)使用大型寄存器文件来容纳所有活动线程并加速上下文切换。不幸的是，由于长访问延迟、高功耗和大芯片面积配置，寄存器文件是未来gpu的可伸缩性瓶颈。先前的工作提出了分层寄存器文件，通过在较小的寄存器文件缓存中缓存寄存器来减少寄存器文件的功耗。不幸的是，由于寄存器文件缓存中的低命中率，这种方法不能改善寄存器访问延迟。在本文中，我们提出了延迟容忍寄存器文件(LTRF)架构，以实现低延迟的两级分层结构，同时保持低功耗。我们观察到编译时间间隔分析使我们能够将GPU程序执行划分为间隔，并在每个间隔内准确估计warp的总寄存器工作集。LTRF的关键思想是在软件控制下，在每个间隔的开始，从主寄存器文件预取估计的寄存器工作集到寄存器文件缓存，并将预取延迟与其他warps的执行重叠。我们的实验结果表明，ltf支持高容量但长延迟的主GPU寄存器文件，为各种优化铺平了道路。作为优化示例，我们使用新兴的高密度高延迟内存技术实现主寄存器文件，使容量增加8倍，并将整体GPU性能提高31%，同时将寄存器文件功耗降低46%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

自引率

0.00%

发文量