QuickRelease: A throughput-oriented approach to release consistency on GPUs

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-06-19 DOI:10.1109/HPCA.2014.6835930

Blake A. Hechtman, Shuai Che, Derek Hower, Yingying Tian, Bradford M. Beckmann, M. Hill, S. Reinhardt, D. Wood

{"title":"QuickRelease: A throughput-oriented approach to release consistency on GPUs","authors":"Blake A. Hechtman, Shuai Che, Derek Hower, Yingying Tian, Bradford M. Beckmann, M. Hill, S. Reinhardt, D. Wood","doi":"10.1109/HPCA.2014.6835930","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) have specialized throughput-oriented memory systems optimized for streaming writes with scratchpad memories to capture locality explicitly. Expanding the utility of GPUs beyond graphics encourages designs that simplify programming (e.g., using caches instead of scratchpads) and better support irregular applications with finer-grain synchronization. Our hypothesis is that, like CPUs, GPUs will benefit from caches and coherence, but that CPU-style “read for ownership” (RFO) coherence is inappropriate to maintain support for regular streaming workloads. This paper proposes QuickRelease (QR), which improves on conventional GPU memory systems in two ways. First, QR uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes. Thus, non-synchronizing threads in QR can re-use cached data even when other threads are performing synchronization. Second, QR partitions the resources required by reads and writes to reduce the penalty of writes on read performance. Simulation results across a wide variety of general-purpose GPU workloads show that QR achieves a 7% average performance improvement compared to a conventional GPU memory system. Furthermore, for emerging workloads with finer-grain synchronization, QR achieves up to 42% performance improvement compared to a conventional GPU memory system without the scalability challenges of RFO coherence. To this end, QR provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 63

Abstract

Graphics processing units (GPUs) have specialized throughput-oriented memory systems optimized for streaming writes with scratchpad memories to capture locality explicitly. Expanding the utility of GPUs beyond graphics encourages designs that simplify programming (e.g., using caches instead of scratchpads) and better support irregular applications with finer-grain synchronization. Our hypothesis is that, like CPUs, GPUs will benefit from caches and coherence, but that CPU-style “read for ownership” (RFO) coherence is inappropriate to maintain support for regular streaming workloads. This paper proposes QuickRelease (QR), which improves on conventional GPU memory systems in two ways. First, QR uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes. Thus, non-synchronizing threads in QR can re-use cached data even when other threads are performing synchronization. Second, QR partitions the resources required by reads and writes to reduce the penalty of writes on read performance. Simulation results across a wide variety of general-purpose GPU workloads show that QR achieves a 7% average performance improvement compared to a conventional GPU memory system. Furthermore, for emerging workloads with finer-grain synchronization, QR achieves up to 42% performance improvement compared to a conventional GPU memory system without the scalability challenges of RFO coherence. To this end, QR provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.

查看原文本刊更多论文

QuickRelease:一种在gpu上释放一致性的面向吞吐量的方法

图形处理单元(gpu)具有专门的面向吞吐量的内存系统，该系统针对带有刮板存储器的流写入进行了优化，以显式地捕获局部性。将gpu的功能扩展到图形之外，可以鼓励简化编程的设计(例如，使用缓存而不是scratchpad)，并更好地支持具有细粒度同步的不规则应用程序。我们的假设是，像cpu一样，gpu将受益于缓存和一致性，但是cpu风格的“读取所有权”(RFO)一致性不适合维持对常规流工作负载的支持。本文提出了QuickRelease (QR)技术，它从两个方面改进了传统的GPU存储系统。首先，QR使用FIFO来强制写操作的部分顺序，这样同步操作就可以在不频繁刷新缓存的情况下完成。因此，即使其他线程正在执行同步，QR中的非同步线程也可以重用缓存的数据。其次，QR对读和写所需的资源进行分区，以减少写对读性能的影响。各种通用GPU工作负载的仿真结果表明，与传统GPU内存系统相比，QR实现了7%的平均性能提升。此外，对于具有细粒度同步的新兴工作负载，与传统GPU内存系统相比，QR实现了高达42%的性能提升，而没有RFO一致性的可扩展性挑战。为此，QR提供了面向吞吐量的解决方案，在gpu上提供细粒度同步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量