Implications of memory embedding and hierarchy on the performance of MAVeC AI accelerators

Memories - Materials, Devices, Circuits and Systems Pub Date : 2025-04-01 DOI:10.1016/j.memori.2025.100131

Md Rownak Hossain Chowdhury, Mostafizur Rahman

{"title":"Implications of memory embedding and hierarchy on the performance of MAVeC AI accelerators","authors":"Md Rownak Hossain Chowdhury, Mostafizur Rahman","doi":"10.1016/j.memori.2025.100131","DOIUrl":null,"url":null,"abstract":"<div><div>Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm<sup>2</sup>. These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.</div></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"10 ","pages":"Article 100131"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memories - Materials, Devices, Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773064625000118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm². These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.

查看原文本刊更多论文

记忆嵌入和层次结构对MAVeC人工智能加速器性能的影响

内存组织对于任何AI（人工智能）处理器都是必不可少的，因为内存映射的I/O决定了系统的总体吞吐量。无论处理器中集成了多快或多少个并行处理单元，当数据传输速率无法匹配处理能力时，性能最终都会受到影响。因此，内存层次结构中数据编排的有效性是对任何AI加速器性能进行基准测试的基础方面。在这项工作中，我们研究了基于消息的矢量处理单元（MAVeC）的内存组织，其中数据跨计算单元路由以在运行时启用自适应可编程性。MAVeC具有小于100 MB的分级片上存储器结构，以最大限度地减少数据移动，增强局部性和最大限度地提高并行性。作为补充，我们开发了端到端数据编排方法来管理内存层次结构中的数据流。为了评估集成内存的整体性能，我们详细介绍了我们在不同参数下的广泛基准测试结果，包括PCIe （Peripheral Component Interconnect Express）配置、可用硬件资源、工作频率和片外内存带宽。MAVeC为Alex Net实现了每秒95.39K推理的显着吞吐量，工作在1 GHz频率下，64块和32位精度，使用PCIe 6.0 × 16和HBM4片外存储器。在台积电28纳米技术节点中，MAVeC核心的估计面积约为346平方毫米。这些结果强调了MAVeC加速器拟议的内存层次结构的潜力，将其定位为未来人工智能应用的有前途的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Memories - Materials, Devices, Circuits and Systems

自引率

0.00%

发文量