Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao
{"title":"利用芯片内局部性多芯片gpu通过两级共享L1缓存","authors":"Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao","doi":"10.1016/j.sysarc.2025.103500","DOIUrl":null,"url":null,"abstract":"<div><div>Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103500"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache\",\"authors\":\"Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao\",\"doi\":\"10.1016/j.sysarc.2025.103500\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"167 \",\"pages\":\"Article 103500\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125001729\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001729","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache
Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.