Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks

IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Mohammed A. Alhartomi;Mohd Tasleem Khan;Saeed Alzahrani;Ahmed Alzahmi;Rafi Ahamed Shaik;Jinti Hazarika;Ruwaybih Alsulami;Abdulaziz Alotaibi;Meshal Al-Harthi
{"title":"Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks","authors":"Mohammed A. Alhartomi;Mohd Tasleem Khan;Saeed Alzahrani;Ahmed Alzahmi;Rafi Ahamed Shaik;Jinti Hazarika;Ruwaybih Alsulami;Abdulaziz Alotaibi;Meshal Al-Harthi","doi":"10.1109/JETCAS.2023.3330428","DOIUrl":null,"url":null,"abstract":"Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"13 4","pages":"1000-1014"},"PeriodicalIF":3.7000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10309947/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.
短时长内存网络的低面积、低功耗 VLSI 架构
长短期记忆(LSTM)网络被广泛应用于各种顺序学习任务,包括语音识别。它们在实际应用中的重要性促使人们对高性价比、高能效的设计提出了更高的要求。本文介绍了基于分布式运算(DA)的 LSTM 架构,利用环形和块环形矩阵向量乘法(MVM)进行网络压缩。研究考虑了以量化权重为导向的环形矩阵和块环形矩阵训练方法。通过制定定点环形/块环形 MVM,我们探索了内核大小对准确性的影响。我们基于数模转换的方法采用了共享的加-存/存-加全方法和部分方法,然后通过一个选择单元来实现 MVM。然后,它与多部分策略相结合,降低了更大内核尺寸的复杂性。通过优化多选择单元的解码器,进一步降低了复杂性。加法存储中的流水线设计以牺牲几个流水线寄存器为代价提高了速度。现场可编程门阵列的结果表明,我们提出的基于部分存储-添加方法的体系结构具有优越性,与最先进的体系结构相比,DSP 片数减少了 98.71%,片数查找表减少了 33.59%,触发器减少了 13.43%,功耗减少了 29.76%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.50
自引率
2.20%
发文量
86
期刊介绍: The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信