Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs

Saumay Dublish, V. Nagarajan, N. Topham
{"title":"Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/ISPASS.2017.7975295","DOIUrl":null,"url":null,"abstract":"GPUs are often limited by off-chip memory bandwidth. With the advent of general-purpose computing on GPUs, a cache hierarchy has been introduced to filter the bandwidth demand to the off-chip memory. However, the cache hierarchy presents its own bandwidth limitations in sustaining such high levels of memory traffic. In this paper, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs for generalpurpose applications. We quantify the stalls throughout the memory hierarchy and identify the architectural parameters that play a pivotal role in leading to a congested memory system. We explore the architectural design space to mitigate the bandwidth bottlenecks and show that performance improvement achieved by mitigating the bandwidth bottleneck in the cache hierarchy can exceed the speedup obtained by a memory system with a baseline cache hierarchy and High Bandwidth Memory (HBM) DRAM. We also show that addressing the bandwidth bottleneck in isolation at specific levels can be sub-optimal and can even be counter-productive. Therefore, we show that it is imperative to resolve the bandwidth bottlenecks synergistically across different levels of the memory hierarchy. With the insights developed in this paper, we perform a cost-benefit analysis and identify costeffective configurations of the memory hierarchy that effectively mitigate the bandwidth bottlenecks. We show that our final configuration achieves a performance improvement of 29% on average with a minimal area overhead of 1.6%.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2017.7975295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

GPUs are often limited by off-chip memory bandwidth. With the advent of general-purpose computing on GPUs, a cache hierarchy has been introduced to filter the bandwidth demand to the off-chip memory. However, the cache hierarchy presents its own bandwidth limitations in sustaining such high levels of memory traffic. In this paper, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs for generalpurpose applications. We quantify the stalls throughout the memory hierarchy and identify the architectural parameters that play a pivotal role in leading to a congested memory system. We explore the architectural design space to mitigate the bandwidth bottlenecks and show that performance improvement achieved by mitigating the bandwidth bottleneck in the cache hierarchy can exceed the speedup obtained by a memory system with a baseline cache hierarchy and High Bandwidth Memory (HBM) DRAM. We also show that addressing the bandwidth bottleneck in isolation at specific levels can be sub-optimal and can even be counter-productive. Therefore, we show that it is imperative to resolve the bandwidth bottlenecks synergistically across different levels of the memory hierarchy. With the insights developed in this paper, we perform a cost-benefit analysis and identify costeffective configurations of the memory hierarchy that effectively mitigate the bandwidth bottlenecks. We show that our final configuration achieves a performance improvement of 29% on average with a minimal area overhead of 1.6%.
评估和减轻gpu内存层次结构中的带宽瓶颈
gpu通常受到片外内存带宽的限制。随着gpu上通用计算的出现,引入了缓存层次结构来过滤对片外存储器的带宽需求。然而,缓存层次结构在维持如此高的内存流量方面有其自身的带宽限制。在本文中,我们描述了通用应用中gpu内存层次中存在的带宽瓶颈。我们量化了整个内存层次结构中的摊位,并确定了在导致内存系统拥塞中起关键作用的架构参数。我们探索了缓解带宽瓶颈的架构设计空间,并表明通过缓解缓存层次结构中的带宽瓶颈所实现的性能改进可以超过具有基线缓存层次结构和高带宽内存(HBM) DRAM的内存系统所获得的加速。我们还表明,在特定级别孤立地解决带宽瓶颈可能不是最优的,甚至可能适得其反。因此,我们表明,必须跨内存层次结构的不同级别协同解决带宽瓶颈。根据本文中开发的见解,我们进行了成本效益分析,并确定了有效缓解带宽瓶颈的内存层次结构的成本效益配置。我们表明,我们的最终配置平均实现了29%的性能改进,最小的面积开销为1.6%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信