阶段内存调度:在异构系统中实现高性能和可伸缩性

2012 39th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2012-06-09 DOI:10.1145/2366231.2337207

Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu

{"title":"阶段内存调度:在异构系统中实现高性能和可伸缩性","authors":"Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu","doi":"10.1145/2366231.2337207","DOIUrl":null,"url":null,"abstract":"When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"244","resultStr":"{\"title\":\"Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems\",\"authors\":\"Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu\",\"doi\":\"10.1145/2366231.2337207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.\",\"PeriodicalId\":193578,\"journal\":{\"name\":\"2012 39th Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"244\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 39th Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2366231.2337207\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 244

摘要

当多个CPU内核和GPU集成在一块芯片上，共用片外主存时，GPU的请求会严重干扰CPU内核的请求，导致系统性能下降和CPU内核耗尽。不幸的是，由于大量GPU流量，最先进的应用程序感知内存调度算法无法在低复杂度下解决此问题。为了在全局请求流中为这些算法提供足够的可见性，需要一个大而昂贵的请求缓冲区，这需要相对复杂的硬件实现。本文提出了一种全新的方法，将内存控制器的三个主要任务解耦到三个明显更简单的结构中，这些结构共同提高了系统性能和公平性，特别是在集成的CPU-GPU系统中。我们的三阶段内存控制器首先基于行缓冲局部性对请求进行分组。这种分组允许第二阶段只关注应用程序间请求调度。这两个阶段执行有关性能和公平性的高级策略，因此最后一个阶段由简单的每个银行FIFO队列(每个银行内不再有命令重新排序)和仅处理低级DRAM命令和定时的直接逻辑组成。我们评估了我们的分阶段内存调度器(SMS)所涉及的设计权衡，并将其与三种最先进的内存控制器设计进行比较。我们的评估表明，SMS提高了CPU性能，而不会使GPU帧率降低到超出一般可接受的水平，同时实现起来比以前的应用程序感知调度器要简单得多。此外，SMS可以由系统软件配置，以在不同级别上优先考虑CPU或GPU，以满足不同的性能需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 39th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量