FPGA-based component-wise LSTM training accelerator for neural granger causality analysis

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-11-13 DOI:10.1016/j.neucom.2024.128871

Chuliang Guo , Yufei Chen , Yu Fu

{"title":"FPGA-based component-wise LSTM training accelerator for neural granger causality analysis","authors":"Chuliang Guo , Yufei Chen , Yu Fu","doi":"10.1016/j.neucom.2024.128871","DOIUrl":null,"url":null,"abstract":"<div><div>Component-wise LSTM (cLSTM) constitutes multiple LSTM cells of distinct parameters, which has particular benefits of functional Magnetic Resonance Imaging (fMRI)-based neural Granger causality (NGC) analysis for the human brain. Back-propagation through time training on CPU and GPU suffers from low utilization due to inherent data dependencies within the LSTM cell. Moreover, batch 1 cLSTM training and few weight reuses across input feature maps worsen such a utilization problem. To this end, this study provides an FPGA-based training solution for cLSTM-based NGC analysis. The proposed cLSTM training accelerator identifies different data dependencies in forward and backward paths, and features two key components: (1) a fine-grained pipeline within the LSTM cell that achieves the lowest initial interval, and (2) a coarse-grained pipeline that trains input feature sequences across different LSTM cells in parallel. Experiments on the DAN sub-brain network from the COBRE dataset demonstrate the efficacy of FPGA-based cLSTM training, which achieves microseconds iteration latency compared with milliseconds on general-purpose platforms, <em>e.g.,</em> 465<span><math><mo>×</mo></math></span> and 216<span><math><mo>×</mo></math></span> faster than Intel Core 13900K CPU and Nvidia RTX 2080Ti respectively. To the best of our knowledge, this work is the first to demonstrate LSTM training on FPGA, significantly accelerating the analysis and modeling of complex brain networks, and offering valuable advancements for neuroscience research at the edge.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"615 ","pages":"Article 128871"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016424","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Component-wise LSTM (cLSTM) constitutes multiple LSTM cells of distinct parameters, which has particular benefits of functional Magnetic Resonance Imaging (fMRI)-based neural Granger causality (NGC) analysis for the human brain. Back-propagation through time training on CPU and GPU suffers from low utilization due to inherent data dependencies within the LSTM cell. Moreover, batch 1 cLSTM training and few weight reuses across input feature maps worsen such a utilization problem. To this end, this study provides an FPGA-based training solution for cLSTM-based NGC analysis. The proposed cLSTM training accelerator identifies different data dependencies in forward and backward paths, and features two key components: (1) a fine-grained pipeline within the LSTM cell that achieves the lowest initial interval, and (2) a coarse-grained pipeline that trains input feature sequences across different LSTM cells in parallel. Experiments on the DAN sub-brain network from the COBRE dataset demonstrate the efficacy of FPGA-based cLSTM training, which achieves microseconds iteration latency compared with milliseconds on general-purpose platforms, e.g., 465

\times

and 216

\times

faster than Intel Core 13900K CPU and Nvidia RTX 2080Ti respectively. To the best of our knowledge, this work is the first to demonstrate LSTM training on FPGA, significantly accelerating the analysis and modeling of complex brain networks, and offering valuable advancements for neuroscience research at the edge.

查看原文本刊更多论文

基于 FPGA 的分量式 LSTM 训练加速器，用于神经格兰杰因果关系分析

分量式 LSTM（cLSTM）由多个具有不同参数的 LSTM 单元组成，这对于基于功能磁共振成像（fMRI）的人脑格兰杰因果关系（NGC）分析具有特殊的优势。由于 LSTM 单元内部固有的数据依赖性，CPU 和 GPU 上的时间训练反向传播利用率较低。此外，批量 1 cLSTM 训练和输入特征图之间很少的权重重用也加剧了这种利用率问题。为此，本研究为基于 cLSTM 的 NGC 分析提供了一种基于 FPGA 的训练解决方案。所提出的 cLSTM 训练加速器可识别前向和后向路径中的不同数据依赖性，并具有两个关键组件：(1) LSTM 单元内的细粒度流水线，可实现最低初始间隔；(2) 粗粒度流水线，可在不同 LSTM 单元间并行训练输入特征序列。在 COBRE 数据集的 DAN 亚脑网络上进行的实验证明了基于 FPGA 的 cLSTM 训练的功效，与通用平台上的毫秒级迭代延迟相比，它的迭代延迟达到了微秒级，例如，分别比英特尔酷睿 13900K CPU 和 Nvidia RTX 2080Ti 快 465 倍和 216 倍。据我们所知，这项工作首次在 FPGA 上演示了 LSTM 训练，大大加快了复杂大脑网络的分析和建模速度，为边缘神经科学研究提供了宝贵的进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.