GPU Performance Optimization via Intergroup Cache Cooperation

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-11-06 DOI:10.1109/TCAD.2024.3443707

Guosheng Wang;Yajuan Du;Weiming Huang

{"title":"GPU Performance Optimization via Intergroup Cache Cooperation","authors":"Guosheng Wang;Yajuan Du;Weiming Huang","doi":"10.1109/TCAD.2024.3443707","DOIUrl":null,"url":null,"abstract":"Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4142-4153"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10745842/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.

查看原文本刊更多论文

通过组间缓存合作优化 GPU 性能

现代 GPU 集成了多级缓存层次结构，以提供高带宽并缓解内存墙问题。然而，片上缓存的优势远未达到最佳性能。为了利用重复数据，我们提出了一种组间缓存合作（ICC）方法，以建立 L1 缓存组间的合作。根据合作范围，我们设计了相邻缓存合作（ICC-AGC）和多缓存合作（ICC-MGC）两种方案。在 ICC-AGC 中，我们设计了一个相邻合作目录表来实现对重复数据的感知，并集成了一个轻量级网络用于通信。在 ICC-MGC 中，我们设计了一个环形双向网络来实现多个组之间的连接。评估结果表明，与现有工作相比，所提出的两种 ICC 方法可分别减少 10% 和 20% 的二级缓存平均流量，并可分别平均提高 19% 和 49% 的 GPU 整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.