{"title":"Implications of memory embedding and hierarchy on the performance of MAVeC AI accelerators","authors":"Md Rownak Hossain Chowdhury, Mostafizur Rahman","doi":"10.1016/j.memori.2025.100131","DOIUrl":null,"url":null,"abstract":"<div><div>Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm<sup>2</sup>. These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.</div></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"10 ","pages":"Article 100131"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memories - Materials, Devices, Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773064625000118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Memory organization is essential for any AI (Artificial Intelligence) processor, as memory mapped I/O dictates the system's overall throughput. Regardless of how fast or how many parallel processing units are integrated into the processor, the performance will ultimately suffer when data transfer rates fail to match processing capabilities. Therefore, the efficacy of data orchestration within the memory hierarchy is a foundational aspect in benchmarking the performance of any AI accelerator. In this work, we investigate memory organization for a messaging-based vector processing Unit (MAVeC), where data routes across computation units to enable adaptive programmability at runtime. MAVeC features a hierarchical on-chip memory structure of less than 100 MB to minimize data movement, enhance locality, and maximize parallelism. Complementing this, we develop an end-to-end data orchestration methodology to manage data flow within the memory hierarchy. To evaluate the overall performance incorporating memory, we detail our extensive benchmarking results across diverse parameters, including PCIe (Peripheral Component Interconnect Express) configurations, available hardware resources, operating frequencies, and off-chip memory bandwidth. The MAVeC achieves a notable throughput of 95.39K inferences per second for Alex Net, operating at a 1 GHz frequency with 64 tiles and 32-bit precision, using PCIe 6.0 × 16 and HBM4 off-chip memory. In TSMC 28 nm technology node the estimated area for the MAVeC core is approximately 346 mm2. These results underscore the potential of the proposed memory hierarchy for the MAVeC accelerator, positioning it as a promising solution for future AI applications.