ACOPT：用于MCM-GPU架构性能优化的自适应连续性感知地址转换

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-08-04 DOI:10.1016/j.future.2025.108048

Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei

{"title":"ACOPT：用于MCM-GPU架构性能优化的自适应连续性感知地址转换","authors":"Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei","doi":"10.1016/j.future.2025.108048","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54<span><math><mo>×</mo></math></span> speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108048"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ACOPT: Adaptive continuity-aware address translation for performance optimization of MCM-GPU architectures\",\"authors\":\"Jingweijia Tan , Zhanyuntian Li , Weiren Wang , Jiashuo Wang , Kaige Yan , Xiaohui Wei\",\"doi\":\"10.1016/j.future.2025.108048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54<span><math><mo>×</mo></math></span> speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"175 \",\"pages\":\"Article 108048\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X25003437\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25003437","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

在后摩尔时代，多晶片模组（MCM）设计成为提升gpu性能的新兴技术。在mcm - gpu中，多个小芯片通过高带宽和低延迟的包内互连集成在单个物理封装中，与传统的单片gpu相比，提供更多的计算和存储资源。此外，mcm - gpu还可以采用UVM （Unified Virtual Memory）进行数据管理，使mcm - gpu在逻辑上可以作为单个gpu进行编程。本文研究了UVM mcm - gpu的地址转换过程，并分析了其瓶颈。我们对一组GPU工作负载的分析表明，L2 TLB丢失请求会导致显著的PCIe传输延迟、页表行走等待时间和页表行走延迟，这些都不容易隐藏。此外，支持uvm的mcm - gpu中的地址转换过程仍然效率低下，因为它们忽略了内存访问模式的空间连续性和严格的转换粒度。基于这些观察，我们提出了ACOPT，一个用于mcm - gpu的连续性感知地址转换框架。ACOPT采用基于硬件的设计，自适应地捕获连续的多粒度页面，并将它们存储到一个页表条目（PTE）中，这样可以提前释放等待页表漫步的多个挂起请求，以减少页表漫步的数量。通过这种方式，可以将多个页面提取到一个L2 TLB项中，以便在发生L2 TLB缺失时有效地扩展L2 TLB覆盖范围。我们的实验结果表明，ACOPT能够实现1.54倍的加速，在一组应用程序中平均减少78%的页表行走次数，减少70%的L1 TLB遗漏延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ACOPT: Adaptive continuity-aware address translation for performance optimization of MCM-GPU architectures

Multi-Chiplet-Module (MCM) designs become a rising technique to boost the performance of GPUs in the post-Moore era. In MCM-GPUs, multiple chiplets are integrated within a single physical package via high-bandwidth and low-latency in-package interconnections to provide more computational and storage resources, compared with traditional monolithic GPUs. Furthermore, MCM-GPUs can adopt Unified Virtual Memory (UVM) for data management, enabling them to be programmed as logically single GPUs. In this paper, we investigate the address translation process and analyze its bottlenecks in UVM MCM-GPUs. Our profiling of a set of GPU workloads reveals that L2 TLB miss requests incur significant PCIe transfer delays, page table walkers waiting time, and page table walk latency, which are not easily hidden. In addition, the address translation process in UVM-enabled MCM-GPUs remains inefficient as they neglect spatial continuity in memory access patterns and rigid translation granularity. Based on these observations, we propose ACOPT, a continuity-aware address translation framework for MCM-GPUs. ACOPT employs a hardware-based design to adaptively capture continuous multi-granularity pages and store them into one page table entry (PTE), which releases multiple pending requests waiting for page table walkers in advance to reduce the number of page table walks. In this way, multiple pages are fetched into one L2 TLB entry to effectively extend the L2 TLB reach when L2 TLB miss happens. Our experimental results reveal that ACOPT is able to achieve 1.54

\times

speedup, 78% reduction in number of page table walks, and 70% reduction of L1 TLB misses latency on average across a set of applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.