CLMalloc: contiguous memory management mechanism for large-scale CPU-accelerator hybrid architectures

International Symposium on Computer Engineering and Intelligent Communications Pub Date : 2023-02-02 DOI:10.1117/12.2660807

Yushuqing Zhang, Kai Lu, Wen-zhe Zhang

{"title":"CLMalloc: contiguous memory management mechanism for large-scale CPU-accelerator hybrid architectures","authors":"Yushuqing Zhang, Kai Lu, Wen-zhe Zhang","doi":"10.1117/12.2660807","DOIUrl":null,"url":null,"abstract":"Heterogeneous accelerators play a crucial role in improving computer performance. General-purpose computers reduce the frequent communication between traditional accelerators with separate memory and the host computer through fast communication links. Some high-speed devices such as supercomputers integrate the accelerator and CPU on one chip, and the shared memory is managed by the operating system, which shifts the performance bottleneck from data acquisition to accelerator addressing. Existing memory management mechanisms typically reserve contiguous physical memory locally for peripherals for efficient direct memory access. However, in large computer systems with multiple memory nodes, the accelerator's memory access behavior is limited by the local memory capacity. The difficulty of addressing accelerators across nodes prevents computers from maximizing the benefits of massive memory. This paper proposes a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs. In simulation experiments, CLMalloc achieves similar (or even better) performance to the system functions malloc/free. Compared with the DMA-based baseline, the space utilization of CLMalloc is increased by 2×, and the latency is reduced by 80% to 90%.","PeriodicalId":220312,"journal":{"name":"International Symposium on Computer Engineering and Intelligent Communications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Symposium on Computer Engineering and Intelligent Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2660807","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Heterogeneous accelerators play a crucial role in improving computer performance. General-purpose computers reduce the frequent communication between traditional accelerators with separate memory and the host computer through fast communication links. Some high-speed devices such as supercomputers integrate the accelerator and CPU on one chip, and the shared memory is managed by the operating system, which shifts the performance bottleneck from data acquisition to accelerator addressing. Existing memory management mechanisms typically reserve contiguous physical memory locally for peripherals for efficient direct memory access. However, in large computer systems with multiple memory nodes, the accelerator's memory access behavior is limited by the local memory capacity. The difficulty of addressing accelerators across nodes prevents computers from maximizing the benefits of massive memory. This paper proposes a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs. In simulation experiments, CLMalloc achieves similar (or even better) performance to the system functions malloc/free. Compared with the DMA-based baseline, the space utilization of CLMalloc is increased by 2×, and the latency is reduced by 80% to 90%.

查看原文本刊更多论文

CLMalloc:大规模cpu -加速器混合架构的连续内存管理机制

异构加速器在提高计算机性能方面起着至关重要的作用。通用计算机通过快速通信链路减少了具有独立存储器的传统加速器与主机之间的频繁通信。一些高速设备(如超级计算机)将加速器和CPU集成在一个芯片上，共享内存由操作系统管理，从而将性能瓶颈从数据采集转移到加速器寻址。现有的内存管理机制通常会在本地为外设保留连续的物理内存，以便有效地直接访问内存。然而，在具有多个内存节点的大型计算机系统中，加速器的内存访问行为受到本地内存容量的限制。跨节点寻址加速器的困难使计算机无法最大限度地利用海量内存。本文提出了一种面向大型CPU-加速器混合架构(CLMalloc)的连续内存管理机制，以同时支持CPU和加速器程序对不同类型内存的需求。在模拟实验中，CLMalloc实现了与系统函数malloc/free相似(甚至更好)的性能。与基于dma的基线相比，CLMalloc的空间利用率提高了2倍，延迟降低了80% ~ 90%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Symposium on Computer Engineering and Intelligent Communications

自引率

0.00%

发文量