Scalable Multi-FPGA HPC Architecture for Associative Memory System.

Deyu Wang, Xiaoze Yan, Yu Yang, Dimitrios Stathis, Ahmed Hemani, Anders Lansner, Jiawei Xu, Li-Rong Zheng, Zhuo Zou
{"title":"Scalable Multi-FPGA HPC Architecture for Associative Memory System.","authors":"Deyu Wang, Xiaoze Yan, Yu Yang, Dimitrios Stathis, Ahmed Hemani, Anders Lansner, Jiawei Xu, Li-Rong Zheng, Zhuo Zou","doi":"10.1109/TBCAS.2024.3446660","DOIUrl":null,"url":null,"abstract":"<p><p>Associative memory is a cornerstone of cognitive intelligence within the human brain. The Bayesian confidence propagation neural network (BCPNN), a cortex-inspired model with high biological plausibility, has proven effective in emulating high-level cognitive functions like associative memory. However, the current approach using GPUs to simulate BCPNN-based associative memory tasks encounters challenges in latency and power efficiency as the model size scales. This work proposes a scalable multi-FPGA high performance computing (HPC) architecture designed for the associative memory system. The architecture integrates a set of hypercolumn unit (HCU) computing cores for intra-board online learning and inference, along with a spike-based synchronization scheme for inter-board communication among multiple FPGAs. Several design strategies, including population-based model mapping, packet-based spike synchronization, and cluster-based timing optimization, are presented to facilitate the multi-FPGA implementation. The architecture is implemented and validated on two Xilinx Alveo U50 FPGA cards, achieving a maximum model size of 200×10 and a peak working frequency of 220 MHz for the associative memory system. Both the memory-bounded spatial scalability and compute-bounded temporal scalability of the architecture are evaluated and optimized, achieving a maximum scale-latency ratio (SLR) of 268.82 for the two-FPGA implementation. Compared to a two-GPU counterpart, the two-FPGA approach demonstrates a maximum latency reduction of 51.72× and a power reduction exceeding 5.28× under the same network configuration. Compared with the state-of-the-art works, the two-FPGA implementation exhibits a high pattern storage capacity for the associative memory task.</p>","PeriodicalId":94031,"journal":{"name":"IEEE transactions on biomedical circuits and systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biomedical circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TBCAS.2024.3446660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Associative memory is a cornerstone of cognitive intelligence within the human brain. The Bayesian confidence propagation neural network (BCPNN), a cortex-inspired model with high biological plausibility, has proven effective in emulating high-level cognitive functions like associative memory. However, the current approach using GPUs to simulate BCPNN-based associative memory tasks encounters challenges in latency and power efficiency as the model size scales. This work proposes a scalable multi-FPGA high performance computing (HPC) architecture designed for the associative memory system. The architecture integrates a set of hypercolumn unit (HCU) computing cores for intra-board online learning and inference, along with a spike-based synchronization scheme for inter-board communication among multiple FPGAs. Several design strategies, including population-based model mapping, packet-based spike synchronization, and cluster-based timing optimization, are presented to facilitate the multi-FPGA implementation. The architecture is implemented and validated on two Xilinx Alveo U50 FPGA cards, achieving a maximum model size of 200×10 and a peak working frequency of 220 MHz for the associative memory system. Both the memory-bounded spatial scalability and compute-bounded temporal scalability of the architecture are evaluated and optimized, achieving a maximum scale-latency ratio (SLR) of 268.82 for the two-FPGA implementation. Compared to a two-GPU counterpart, the two-FPGA approach demonstrates a maximum latency reduction of 51.72× and a power reduction exceeding 5.28× under the same network configuration. Compared with the state-of-the-art works, the two-FPGA implementation exhibits a high pattern storage capacity for the associative memory task.

联想存储器系统的可扩展多 FPGA HPC 架构。
联想记忆是人类大脑认知智能的基石。贝叶斯置信传播神经网络(BCPN)是一种受大脑皮层启发的模型,具有很高的生物学可信度,已被证明能有效模拟联想记忆等高级认知功能。然而,目前使用 GPU 模拟基于 BCPNN 的联想记忆任务的方法,随着模型规模的扩大,在延迟和能效方面遇到了挑战。本研究提出了一种专为联想记忆系统设计的可扩展多 FPGA 高性能计算(HPC)架构。该架构集成了一组用于板内在线学习和推理的超列单元(HCU)计算内核,以及用于多个 FPGA 之间板内通信的基于尖峰的同步方案。介绍了几种设计策略,包括基于群体的模型映射、基于分组的尖峰同步和基于集群的时序优化,以促进多 FPGA 的实现。该架构在两块 Xilinx Alveo U50 FPGA 卡上实现并通过验证,关联存储器系统的最大模型尺寸为 200×10,峰值工作频率为 220 MHz。评估和优化了该架构的内存约束空间可扩展性和计算约束时间可扩展性,在两个 FPGA 实现中,最大扩展延迟比 (SLR) 达到 268.82。与双 GPU 对应方案相比,在相同的网络配置下,双 FPGA 方案的最大延迟降低了 51.72 倍,功耗降低了 5.28 倍。与最先进的作品相比,双FPGA实施方案在联想存储器任务中表现出较高的模式存储能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信