Implications of memory embedding and hierarchy on the performance of MAVeC AI accelerators

Md Rownak Hossain Chowdhury, Mostafizur Rahman
{"title":"Implications of memory embedding and hierarchy on the performance of MAVeC AI accelerators","authors":"Md Rownak Hossain Chowdhury,&nbsp;Mostafizur Rahman","doi":"10.1016/j.memori.2025.100131","DOIUrl":null,"url":null,"abstract":"<div><div>Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm<sup>2</sup>. These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.</div></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"10 ","pages":"Article 100131"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memories - Materials, Devices, Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773064625000118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm2. These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.
记忆嵌入和层次结构对MAVeC人工智能加速器性能的影响
内存组织对于任何AI(人工智能)处理器都是必不可少的,因为内存映射的I/O决定了系统的总体吞吐量。无论处理器中集成了多快或多少个并行处理单元,当数据传输速率无法匹配处理能力时,性能最终都会受到影响。因此,内存层次结构中数据编排的有效性是对任何AI加速器性能进行基准测试的基础方面。在这项工作中,我们研究了基于消息的矢量处理单元(MAVeC)的内存组织,其中数据跨计算单元路由以在运行时启用自适应可编程性。MAVeC具有小于100 MB的分级片上存储器结构,以最大限度地减少数据移动,增强局部性和最大限度地提高并行性。作为补充,我们开发了端到端数据编排方法来管理内存层次结构中的数据流。为了评估集成内存的整体性能,我们详细介绍了我们在不同参数下的广泛基准测试结果,包括PCIe (Peripheral Component Interconnect Express)配置、可用硬件资源、工作频率和片外内存带宽。MAVeC为Alex Net实现了每秒95.39K推理的显着吞吐量,工作在1 GHz频率下,64块和32位精度,使用PCIe 6.0 × 16和HBM4片外存储器。在台积电28纳米技术节点中,MAVeC核心的估计面积约为346平方毫米。这些结果强调了MAVeC加速器拟议的内存层次结构的潜力,将其定位为未来人工智能应用的有前途的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信