Energy-efficient mechanisms for managing thread context in throughput processors

2011 38th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2011-06-04 DOI:10.1145/2000064.2000093

Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron

{"title":"Energy-efficient mechanisms for managing thread context in throughput processors","authors":"Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron","doi":"10.1145/2000064.2000093","DOIUrl":null,"url":null,"abstract":"Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"267","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 267

Abstract

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.

查看原文本刊更多论文

在吞吐量处理器中管理线程上下文的节能机制

现代图形处理单元(gpu)使用大量的硬件线程来隐藏功能单元和内存访问延迟。极端多线程需要一个复杂的线程调度器和一个大的寄存器文件，这在能量和延迟方面都是昂贵的。我们提出了两种互补的技术来减少像gpu这样的大线程处理器上的能量。首先，我们检查寄存器文件缓存，将对大型主寄存器文件的访问替换为对包含活动线程的即时寄存器工作集的较小结构的访问。其次，我们研究了一个两级线程调度器，它维护一小组活动线程来隐藏ALU和本地内存访问延迟，以及一组较大的挂起线程来隐藏主内存延迟。与寄存器文件缓存相结合，两级线程调度器通过将临时寄存器缓存资源的分配限制为当前活动的线程子集，进一步降低了能耗。我们表明，平均而言，在各种现实世界的图形和计算工作负载中，每个线程6个条目的寄存器文件缓存分别将主寄存器文件的读和写次数减少了50%和59%。我们进一步表明，活动线程计数可以在对性能影响最小的情况下减少1 / 4，从而使寄存器文件能量减少36%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量