Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei
{"title":"ACOPT:用于MCM-GPU架构性能优化的自适应连续性感知地址转换","authors":"Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei","doi":"10.1016/j.future.2025.108048","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54<span><math><mo>×</mo></math></span> speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108048"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ACOPT: Adaptive continuity-aware address translation for performance optimization of MCM-GPU architectures\",\"authors\":\"Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei\",\"doi\":\"10.1016/j.future.2025.108048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54<span><math><mo>×</mo></math></span> speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"175 \",\"pages\":\"Article 108048\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X25003437\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003437","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
ACOPT: Adaptive continuity-aware address translation for performance optimization of MCM-GPU architectures
Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54 speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.