CLMalloc: contiguous memory management mechanism for large-scale CPU-accelerator hybrid architectures

Yushuqing Zhang, Kai Lu, Wen-zhe Zhang
{"title":"CLMalloc: contiguous memory management mechanism for large-scale CPU-accelerator hybrid architectures","authors":"Yushuqing Zhang, Kai Lu, Wen-zhe Zhang","doi":"10.1117/12.2660807","DOIUrl":null,"url":null,"abstract":"Heterogeneous accelerators play a crucial role in improving computer performance. General-purpose computers reduce the frequent communication between traditional accelerators with separate memory and the host computer through fast communication links. Some high-speed devices such as supercomputers integrate the accelerator and CPU on one chip, and the shared memory is managed by the operating system, which shifts the performance bottleneck from data acquisition to accelerator addressing. Existing memory management mechanisms typically reserve contiguous physical memory locally for peripherals for efficient direct memory access. However, in large computer systems with multiple memory nodes, the accelerator's memory access behavior is limited by the local memory capacity. The difficulty of addressing accelerators across nodes prevents computers from maximizing the benefits of massive memory. This paper proposes a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs. In simulation experiments, CLMalloc achieves similar (or even better) performance to the system functions malloc/free. Compared with the DMA-based baseline, the space utilization of CLMalloc is increased by 2×, and the latency is reduced by 80% to 90%.","PeriodicalId":220312,"journal":{"name":"International Symposium on Computer Engineering and Intelligent Communications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Symposium on Computer Engineering and Intelligent Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2660807","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Heterogeneous accelerators play a crucial role in improving computer performance. General-purpose computers reduce the frequent communication between traditional accelerators with separate memory and the host computer through fast communication links. Some high-speed devices such as supercomputers integrate the accelerator and CPU on one chip, and the shared memory is managed by the operating system, which shifts the performance bottleneck from data acquisition to accelerator addressing. Existing memory management mechanisms typically reserve contiguous physical memory locally for peripherals for efficient direct memory access. However, in large computer systems with multiple memory nodes, the accelerator's memory access behavior is limited by the local memory capacity. The difficulty of addressing accelerators across nodes prevents computers from maximizing the benefits of massive memory. This paper proposes a contiguous memory management mechanism for a large-scale CPU-accelerator hybrid architecture (CLMalloc) to simultaneously support the different types of memory requirements of CPU and accelerator programs. In simulation experiments, CLMalloc achieves similar (or even better) performance to the system functions malloc/free. Compared with the DMA-based baseline, the space utilization of CLMalloc is increased by 2×, and the latency is reduced by 80% to 90%.
CLMalloc:大规模cpu -加速器混合架构的连续内存管理机制
异构加速器在提高计算机性能方面起着至关重要的作用。通用计算机通过快速通信链路减少了具有独立存储器的传统加速器与主机之间的频繁通信。一些高速设备(如超级计算机)将加速器和CPU集成在一个芯片上,共享内存由操作系统管理,从而将性能瓶颈从数据采集转移到加速器寻址。现有的内存管理机制通常会在本地为外设保留连续的物理内存,以便有效地直接访问内存。然而,在具有多个内存节点的大型计算机系统中,加速器的内存访问行为受到本地内存容量的限制。跨节点寻址加速器的困难使计算机无法最大限度地利用海量内存。本文提出了一种面向大型CPU-加速器混合架构(CLMalloc)的连续内存管理机制,以同时支持CPU和加速器程序对不同类型内存的需求。在模拟实验中,CLMalloc实现了与系统函数malloc/free相似(甚至更好)的性能。与基于dma的基线相比,CLMalloc的空间利用率提高了2倍,延迟降低了80% ~ 90%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信