Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

2012 39th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2012-06-09 DOI:10.1145/2366231.2337207

Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu

{"title":"Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems","authors":"Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu","doi":"10.1145/2366231.2337207","DOIUrl":null,"url":null,"abstract":"When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"244","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 244

Abstract

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.

查看原文本刊更多论文

阶段内存调度:在异构系统中实现高性能和可伸缩性

当多个CPU内核和GPU集成在一块芯片上，共用片外主存时，GPU的请求会严重干扰CPU内核的请求，导致系统性能下降和CPU内核耗尽。不幸的是，由于大量GPU流量，最先进的应用程序感知内存调度算法无法在低复杂度下解决此问题。为了在全局请求流中为这些算法提供足够的可见性，需要一个大而昂贵的请求缓冲区，这需要相对复杂的硬件实现。本文提出了一种全新的方法，将内存控制器的三个主要任务解耦到三个明显更简单的结构中，这些结构共同提高了系统性能和公平性，特别是在集成的CPU-GPU系统中。我们的三阶段内存控制器首先基于行缓冲局部性对请求进行分组。这种分组允许第二阶段只关注应用程序间请求调度。这两个阶段执行有关性能和公平性的高级策略，因此最后一个阶段由简单的每个银行FIFO队列(每个银行内不再有命令重新排序)和仅处理低级DRAM命令和定时的直接逻辑组成。我们评估了我们的分阶段内存调度器(SMS)所涉及的设计权衡，并将其与三种最先进的内存控制器设计进行比较。我们的评估表明，SMS提高了CPU性能，而不会使GPU帧率降低到超出一般可接受的水平，同时实现起来比以前的应用程序感知调度器要简单得多。此外，SMS可以由系统软件配置，以在不同级别上优先考虑CPU或GPU，以满足不同的性能需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 39th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量