Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu
{"title":"Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems","authors":"Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu","doi":"10.1145/2366231.2337207","DOIUrl":null,"url":null,"abstract":"When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"244","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 244

Abstract

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.
阶段内存调度:在异构系统中实现高性能和可伸缩性
当多个CPU内核和GPU集成在一块芯片上,共用片外主存时,GPU的请求会严重干扰CPU内核的请求,导致系统性能下降和CPU内核耗尽。不幸的是,由于大量GPU流量,最先进的应用程序感知内存调度算法无法在低复杂度下解决此问题。为了在全局请求流中为这些算法提供足够的可见性,需要一个大而昂贵的请求缓冲区,这需要相对复杂的硬件实现。本文提出了一种全新的方法,将内存控制器的三个主要任务解耦到三个明显更简单的结构中,这些结构共同提高了系统性能和公平性,特别是在集成的CPU-GPU系统中。我们的三阶段内存控制器首先基于行缓冲局部性对请求进行分组。这种分组允许第二阶段只关注应用程序间请求调度。这两个阶段执行有关性能和公平性的高级策略,因此最后一个阶段由简单的每个银行FIFO队列(每个银行内不再有命令重新排序)和仅处理低级DRAM命令和定时的直接逻辑组成。我们评估了我们的分阶段内存调度器(SMS)所涉及的设计权衡,并将其与三种最先进的内存控制器设计进行比较。我们的评估表明,SMS提高了CPU性能,而不会使GPU帧率降低到超出一般可接受的水平,同时实现起来比以前的应用程序感知调度器要简单得多。此外,SMS可以由系统软件配置,以在不同级别上优先考虑CPU或GPU,以满足不同的性能需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信