FPGA-based component-wise LSTM training accelerator for neural granger causality analysis

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chuliang Guo , Yufei Chen , Yu Fu
{"title":"FPGA-based component-wise LSTM training accelerator for neural granger causality analysis","authors":"Chuliang Guo ,&nbsp;Yufei Chen ,&nbsp;Yu Fu","doi":"10.1016/j.neucom.2024.128871","DOIUrl":null,"url":null,"abstract":"<div><div>Component-wise LSTM (cLSTM) constitutes multiple LSTM cells of distinct parameters, which has particular benefits of functional Magnetic Resonance Imaging (fMRI)-based neural Granger causality (NGC) analysis for the human brain. Back-propagation through time training on CPU and GPU suffers from low utilization due to inherent data dependencies within the LSTM cell. Moreover, batch 1 cLSTM training and few weight reuses across input feature maps worsen such a utilization problem. To this end, this study provides an FPGA-based training solution for cLSTM-based NGC analysis. The proposed cLSTM training accelerator identifies different data dependencies in forward and backward paths, and features two key components: (1) a fine-grained pipeline within the LSTM cell that achieves the lowest initial interval, and (2) a coarse-grained pipeline that trains input feature sequences across different LSTM cells in parallel. Experiments on the DAN sub-brain network from the COBRE dataset demonstrate the efficacy of FPGA-based cLSTM training, which achieves microseconds iteration latency compared with milliseconds on general-purpose platforms, <em>e.g.,</em> 465<span><math><mo>×</mo></math></span> and 216<span><math><mo>×</mo></math></span> faster than Intel Core 13900K CPU and Nvidia RTX 2080Ti respectively. To the best of our knowledge, this work is the first to demonstrate LSTM training on FPGA, significantly accelerating the analysis and modeling of complex brain networks, and offering valuable advancements for neuroscience research at the edge.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"615 ","pages":"Article 128871"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016424","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Component-wise LSTM (cLSTM) constitutes multiple LSTM cells of distinct parameters, which has particular benefits of functional Magnetic Resonance Imaging (fMRI)-based neural Granger causality (NGC) analysis for the human brain. Back-propagation through time training on CPU and GPU suffers from low utilization due to inherent data dependencies within the LSTM cell. Moreover, batch 1 cLSTM training and few weight reuses across input feature maps worsen such a utilization problem. To this end, this study provides an FPGA-based training solution for cLSTM-based NGC analysis. The proposed cLSTM training accelerator identifies different data dependencies in forward and backward paths, and features two key components: (1) a fine-grained pipeline within the LSTM cell that achieves the lowest initial interval, and (2) a coarse-grained pipeline that trains input feature sequences across different LSTM cells in parallel. Experiments on the DAN sub-brain network from the COBRE dataset demonstrate the efficacy of FPGA-based cLSTM training, which achieves microseconds iteration latency compared with milliseconds on general-purpose platforms, e.g., 465× and 216× faster than Intel Core 13900K CPU and Nvidia RTX 2080Ti respectively. To the best of our knowledge, this work is the first to demonstrate LSTM training on FPGA, significantly accelerating the analysis and modeling of complex brain networks, and offering valuable advancements for neuroscience research at the edge.
基于 FPGA 的分量式 LSTM 训练加速器,用于神经格兰杰因果关系分析
分量式 LSTM(cLSTM)由多个具有不同参数的 LSTM 单元组成,这对于基于功能磁共振成像(fMRI)的人脑格兰杰因果关系(NGC)分析具有特殊的优势。由于 LSTM 单元内部固有的数据依赖性,CPU 和 GPU 上的时间训练反向传播利用率较低。此外,批量 1 cLSTM 训练和输入特征图之间很少的权重重用也加剧了这种利用率问题。为此,本研究为基于 cLSTM 的 NGC 分析提供了一种基于 FPGA 的训练解决方案。所提出的 cLSTM 训练加速器可识别前向和后向路径中的不同数据依赖性,并具有两个关键组件:(1) LSTM 单元内的细粒度流水线,可实现最低初始间隔;(2) 粗粒度流水线,可在不同 LSTM 单元间并行训练输入特征序列。在 COBRE 数据集的 DAN 亚脑网络上进行的实验证明了基于 FPGA 的 cLSTM 训练的功效,与通用平台上的毫秒级迭代延迟相比,它的迭代延迟达到了微秒级,例如,分别比英特尔酷睿 13900K CPU 和 Nvidia RTX 2080Ti 快 465 倍和 216 倍。据我们所知,这项工作首次在 FPGA 上演示了 LSTM 训练,大大加快了复杂大脑网络的分析和建模速度,为边缘神经科学研究提供了宝贵的进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信