Scalable Multi-FPGA HPC Architecture for Associative Memory System

IEEE transactions on biomedical circuits and systems Pub Date : 2024-08-20 DOI:10.1109/TBCAS.2024.3446660

Deyu Wang;Xiaoze Yan;Yu Yang;Dimitrios Stathis;Ahmed Hemani;Anders Lansner;Jiawei Xu;Li-Rong Zheng;Zhuo Zou

{"title":"Scalable Multi-FPGA HPC Architecture for Associative Memory System","authors":"Deyu Wang;Xiaoze Yan;Yu Yang;Dimitrios Stathis;Ahmed Hemani;Anders Lansner;Jiawei Xu;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/TBCAS.2024.3446660","DOIUrl":null,"url":null,"abstract":"Associative memory is a cornerstone of cognitive intelligence within the human brain. The Bayesian confidence propagation neural network (BCPNN), a cortex-inspired model with high biological plausibility, has proven effective in emulating high-level cognitive functions like associative memory. However, the current approach using GPUs to simulate BCPNN-based associative memory tasks encounters challenges in latency and power efficiency as the model size scales. This work proposes a scalable multi-FPGA high performance computing (HPC) architecture designed for the associative memory system. The architecture integrates a set of hypercolumn unit (HCU) computing cores for intra-board online learning and inference, along with a spike-based synchronization scheme for inter-board communication among multiple FPGAs. Several design strategies, including population-based model mapping, packet-based spike synchronization, and cluster-based timing optimization, are presented to facilitate the multi-FPGA implementation. The architecture is implemented and validated on two Xilinx Alveo U50 FPGA cards, achieving a maximum model size of 200<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>10 and a peak working frequency of 220 MHz for the associative memory system. Both the memory-bounded spatial scalability and compute-bounded temporal scalability of the architecture are evaluated and optimized, achieving a maximum scale-latency ratio (SLR) of 268.82 for the two-FPGA implementation. Compared to a two-GPU counterpart, the two-FPGA approach demonstrates a maximum latency reduction of 51.72<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> and a power reduction exceeding 5.28<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> under the same network configuration. Compared with the state-of-the-art works, the two-FPGA implementation exhibits a high pattern storage capacity for the associative memory task.","PeriodicalId":94031,"journal":{"name":"IEEE transactions on biomedical circuits and systems","volume":"19 2","pages":"454-468"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biomedical circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10643037/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Associative memory is a cornerstone of cognitive intelligence within the human brain. The Bayesian confidence propagation neural network (BCPNN), a cortex-inspired model with high biological plausibility, has proven effective in emulating high-level cognitive functions like associative memory. However, the current approach using GPUs to simulate BCPNN-based associative memory tasks encounters challenges in latency and power efficiency as the model size scales. This work proposes a scalable multi-FPGA high performance computing (HPC) architecture designed for the associative memory system. The architecture integrates a set of hypercolumn unit (HCU) computing cores for intra-board online learning and inference, along with a spike-based synchronization scheme for inter-board communication among multiple FPGAs. Several design strategies, including population-based model mapping, packet-based spike synchronization, and cluster-based timing optimization, are presented to facilitate the multi-FPGA implementation. The architecture is implemented and validated on two Xilinx Alveo U50 FPGA cards, achieving a maximum model size of 200

$\boldsymbol{\times}$

10 and a peak working frequency of 220 MHz for the associative memory system. Both the memory-bounded spatial scalability and compute-bounded temporal scalability of the architecture are evaluated and optimized, achieving a maximum scale-latency ratio (SLR) of 268.82 for the two-FPGA implementation. Compared to a two-GPU counterpart, the two-FPGA approach demonstrates a maximum latency reduction of 51.72

$\boldsymbol{\times}$

and a power reduction exceeding 5.28

$\boldsymbol{\times}$

under the same network configuration. Compared with the state-of-the-art works, the two-FPGA implementation exhibits a high pattern storage capacity for the associative memory task.

查看原文本刊更多论文

联想存储器系统的可扩展多 FPGA HPC 架构。

联想记忆是人类大脑认知智能的基石。贝叶斯置信传播神经网络（BCPN）是一种受大脑皮层启发的模型，具有很高的生物学可信度，已被证明能有效模拟联想记忆等高级认知功能。然而，目前使用 GPU 模拟基于 BCPNN 的联想记忆任务的方法，随着模型规模的扩大，在延迟和能效方面遇到了挑战。本研究提出了一种专为联想记忆系统设计的可扩展多 FPGA 高性能计算（HPC）架构。该架构集成了一组用于板内在线学习和推理的超列单元（HCU）计算内核，以及用于多个 FPGA 之间板内通信的基于尖峰的同步方案。介绍了几种设计策略，包括基于群体的模型映射、基于分组的尖峰同步和基于集群的时序优化，以促进多 FPGA 的实现。该架构在两块 Xilinx Alveo U50 FPGA 卡上实现并通过验证，关联存储器系统的最大模型尺寸为 200×10，峰值工作频率为 220 MHz。评估和优化了该架构的内存约束空间可扩展性和计算约束时间可扩展性，在两个 FPGA 实现中，最大扩展延迟比 (SLR) 达到 268.82。与双 GPU 对应方案相比，在相同的网络配置下，双 FPGA 方案的最大延迟降低了 51.72 倍，功耗降低了 5.28 倍。与最先进的作品相比，双FPGA实施方案在联想存储器任务中表现出较高的模式存储能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on biomedical circuits and systems

自引率

0.00%

发文量