短时长内存网络的低面积、低功耗 VLSI 架构

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2023-11-06 DOI:10.1109/JETCAS.2023.3330428

Mohammed A. Alhartomi;Mohd Tasleem Khan;Saeed Alzahrani;Ahmed Alzahmi;Rafi Ahamed Shaik;Jinti Hazarika;Ruwaybih Alsulami;Abdulaziz Alotaibi;Meshal Al-Harthi

{"title":"短时长内存网络的低面积、低功耗 VLSI 架构","authors":"Mohammed A. Alhartomi;Mohd Tasleem Khan;Saeed Alzahrani;Ahmed Alzahmi;Rafi Ahamed Shaik;Jinti Hazarika;Ruwaybih Alsulami;Abdulaziz Alotaibi;Meshal Al-Harthi","doi":"10.1109/JETCAS.2023.3330428","DOIUrl":null,"url":null,"abstract":"Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"13 4","pages":"1000-1014"},"PeriodicalIF":3.7000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks\",\"authors\":\"Mohammed A. Alhartomi;Mohd Tasleem Khan;Saeed Alzahrani;Ahmed Alzahmi;Rafi Ahamed Shaik;Jinti Hazarika;Ruwaybih Alsulami;Abdulaziz Alotaibi;Meshal Al-Harthi\",\"doi\":\"10.1109/JETCAS.2023.3330428\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"13 4\",\"pages\":\"1000-1014\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2023-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10309947/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10309947/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

长短期记忆（LSTM）网络被广泛应用于各种顺序学习任务，包括语音识别。它们在实际应用中的重要性促使人们对高性价比、高能效的设计提出了更高的要求。本文介绍了基于分布式运算（DA）的 LSTM 架构，利用环形和块环形矩阵向量乘法（MVM）进行网络压缩。研究考虑了以量化权重为导向的环形矩阵和块环形矩阵训练方法。通过制定定点环形/块环形 MVM，我们探索了内核大小对准确性的影响。我们基于数模转换的方法采用了共享的加-存/存-加全方法和部分方法，然后通过一个选择单元来实现 MVM。然后，它与多部分策略相结合，降低了更大内核尺寸的复杂性。通过优化多选择单元的解码器，进一步降低了复杂性。加法存储中的流水线设计以牺牲几个流水线寄存器为代价提高了速度。现场可编程门阵列的结果表明，我们提出的基于部分存储-添加方法的体系结构具有优越性，与最先进的体系结构相比，DSP 片数减少了 98.71%，片数查找表减少了 33.59%，触发器减少了 13.43%，功耗减少了 29.76%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks

Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.