A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems

2020 57th ACM/IEEE Design Automation Conference (DAC) Pub Date : 2020-07-01 DOI:10.1109/DAC18072.2020.9218664

Ardhi Wiratama Baskara Yudha, Reza Pulungan, H. Hoffmann, Yan Solihin

{"title":"A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems","authors":"Ardhi Wiratama Baskara Yudha, Reza Pulungan, H. Hoffmann, Yan Solihin","doi":"10.1109/DAC18072.2020.9218664","DOIUrl":null,"url":null,"abstract":"This paper presents a novel approach to accelerate applications running on integrated CPU-GPU systems. Many integrated CPU-GPU systems use cache-coherent shared memory to communicate. For example, after CPU produces data for GPU, the GPU may pull the data into its cache when it accesses the data. In such a pull-based approach, data resides in a shared cache until the GPU accesses it, resulting in long load latency on a first GPU access to a cache line. In this work, we propose a new, push-based, coherence mechanism that explicitly exploits the CPU and GPU producer-consumer relationship by automatically moving data from CPU to GPU last-level cache. The proposed mechanism results in a dramatic reduction of the GPU L2 cache miss rate in general, and a consequent increase in overall performance. Our experiments show that the proposed scheme can increase performance by up to 37%, with typical improvements in the 5–7% range. We find that even when tested applications do not benefit from the proposed approach, their performance does not decrease with our technique. While we demonstrate how the proposed scheme can co-exist with traditional cache coherence mechanisms, we argue that it could also be used as a simpler replacement for existing protocols.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"54 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 57th ACM/IEEE Design Automation Conference (DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAC18072.2020.9218664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents a novel approach to accelerate applications running on integrated CPU-GPU systems. Many integrated CPU-GPU systems use cache-coherent shared memory to communicate. For example, after CPU produces data for GPU, the GPU may pull the data into its cache when it accesses the data. In such a pull-based approach, data resides in a shared cache until the GPU accesses it, resulting in long load latency on a first GPU access to a cache line. In this work, we propose a new, push-based, coherence mechanism that explicitly exploits the CPU and GPU producer-consumer relationship by automatically moving data from CPU to GPU last-level cache. The proposed mechanism results in a dramatic reduction of the GPU L2 cache miss rate in general, and a consequent increase in overall performance. Our experiments show that the proposed scheme can increase performance by up to 37%, with typical improvements in the 5–7% range. We find that even when tested applications do not benefit from the proposed approach, their performance does not decrease with our technique. While we demonstrate how the proposed scheme can co-exist with traditional cache coherence mechanisms, we argue that it could also be used as a simpler replacement for existing protocols.

查看原文本刊更多论文

一种用于CPU-GPU集成系统的简单缓存一致性方案

本文提出了一种加速运行在CPU-GPU集成系统上的应用程序的新方法。许多集成的CPU-GPU系统使用缓存一致的共享内存进行通信。例如，CPU为GPU生成数据后，GPU在访问数据时可能会将数据拉入缓存。在这种基于拉的方法中，数据驻留在共享缓存中，直到GPU访问它，导致第一个GPU访问缓存线的加载延迟很长。在这项工作中，我们提出了一种新的、基于推送的一致性机制，该机制通过自动将数据从CPU移动到GPU的最后一级缓存来明确地利用CPU和GPU的生产者-消费者关系。提议的机制导致GPU L2缓存丢失率的显著降低，并随之提高整体性能。我们的实验表明，提出的方案可以提高性能高达37%，典型的改进在5-7%的范围内。我们发现，即使被测试的应用程序没有从建议的方法中获益，它们的性能也不会因为我们的技术而下降。虽然我们展示了所提出的方案如何与传统的缓存一致性机制共存，但我们认为它也可以作为现有协议的更简单替代品。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 57th ACM/IEEE Design Automation Conference (DAC)

自引率

0.00%

发文量