利用芯片内局部性多芯片gpu通过两级共享L1缓存

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2025-06-19 DOI:10.1016/j.sysarc.2025.103500

Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao

{"title":"利用芯片内局部性多芯片gpu通过两级共享L1缓存","authors":"Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao","doi":"10.1016/j.sysarc.2025.103500","DOIUrl":null,"url":null,"abstract":"<div><div>Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103500"},"PeriodicalIF":4.1000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache\",\"authors\":\"Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao\",\"doi\":\"10.1016/j.sysarc.2025.103500\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"167 \",\"pages\":\"Article 103500\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125001729\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001729","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

多芯片gpu中的远程内存访问由于高延迟和芯片间带宽争用而成为主要的性能瓶颈。利用芯片内局部性可以通过本地提供内存访问和减少跨芯片流量来缓解这一瓶颈。然而，在多芯片gpu中利用局部性的传统粗粒度方法通常会产生过多的开销，从而限制了它们的潜在性能优势。为此，我们提出了TLS-Cache，这是一种两级共享L1缓存，可以有效地利用芯片内局部性，而无需额外的缓存容量。它通过集群共享和远程共享L1缓存支持细粒度的数据重用，从而减轻了高延迟的远程内存访问，这两种缓存分别捕获流多处理器集群内部和跨集群的局部性。这两个缓存协同工作，以最大限度地利用芯片内局部性，并提供可测量的性能增益。实验结果表明，与具有私有L1缓存的基准4芯片GPU相比，TLS-Cache每周期平均提高了30.2%的指令。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache

Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.