高效利用GPGPU缓存层次结构

Proceedings of the 8th Workshop on General Purpose Processing using GPUs Pub Date : 2015-02-07 DOI:10.1145/2716282.2716291

Mahmoud Khairy, M. Zahran, A. Wassal

{"title":"高效利用GPGPU缓存层次结构","authors":"Mahmoud Khairy, M. Zahran, A. Wassal","doi":"10.1145/2716282.2716291","DOIUrl":null,"url":null,"abstract":"Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.","PeriodicalId":432610,"journal":{"name":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"Efficient utilization of GPGPU cache hierarchy\",\"authors\":\"Mahmoud Khairy, M. Zahran, A. Wassal\",\"doi\":\"10.1145/2716282.2716291\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.\",\"PeriodicalId\":432610,\"journal\":{\"name\":\"Proceedings of the 8th Workshop on General Purpose Processing using GPUs\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th Workshop on General Purpose Processing using GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2716282.2716291\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2716282.2716291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

最近的gpu都配备了通用的L1和L2缓存，试图减少内存带宽的需求，提高一些不规则的GPGPU应用程序的性能。然而，由于大量的多线程，GPGPU缓存遭受严重的资源争用和低数据共享，这可能会降低性能。在这项工作中，我们提出了三种技术来有效地利用和提高GPGPU缓存的性能。第一种技术旨在动态检测和绕过显示流行为的内存访问。在第二种技术中，我们提出通过内核采样(DWT-CS)进行动态翘曲节流，通过限制每个内核的活动翘曲数量来减轻缓存抖动。DWT-CS在L1处监视MPKI，当MPKI超过特定阈值时，对所有GPU内核进行不同数量的活动翘曲采样，以找到减轻抖动并实现最高性能的最佳翘曲数量。我们提出的第三种技术解决了GPU缓存关联性的问题，因为许多GPGPU应用程序遭受严重的关联性停滞和冲突缺失。先前的工作建议缓存绕过结合性摊位。在这项工作中，我们采用了一种更好的缓存索引函数，即基于多项式模映射的伪随机交错缓存(Pseudo Random Interleaving cache, pricc)，而不是绕过，以便在缓存集上公平均匀地分配内存访问。所提出的技术将流和争用应用程序的平均性能分别提高1.2倍和2.3倍。与之前的工作相比，它比缓存敏感波前调度器和内存请求优先级缓冲区分别提高了1.7倍和1.5倍的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient utilization of GPGPU cache hierarchy

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our proposed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wavefront Scheduler and Memory Request Prioritization Buffer respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th Workshop on General Purpose Processing using GPUs

自引率

0.00%

发文量