一种灵活的基于数据分析的变向量内积计算体系

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-02-04 DOI:10.1109/TVLSI.2025.3528244

Anil Kali;Samrat L. Sabat;Pramod Kumar Meher

{"title":"一种灵活的基于数据分析的变向量内积计算体系","authors":"Anil Kali;Samrat L. Sabat;Pramod Kumar Meher","doi":"10.1109/TVLSI.2025.3528244","DOIUrl":null,"url":null,"abstract":"The computation of inner products of any given pair of vectors is an indispensable requirement in several applications including artificial intelligence (AI), machine learning (ML), signal processing, image processing, communication, and many others. The throughput requirement of inner product computation varies widely for different applications. Moreover, the throughput of computation must match the requirements of the applications. It is therefore important to design flexible hardware for inner product computation that produces the desired throughput. Distributed arithmetic (DA) is a well-known approach for efficient inner product computation. This article presents an efficient DA-based architecture for computing the inner product of variable vectors, which could be tailored according to the throughput requirement of any given application and reused for different inner product lengths. The proposed designs could also be deployed to achieve a trade-off between throughput and area/energy consumption. In this article, we have used modified Booth encoding (MBE) to reduce the number of partial products and proposed a novel carry-save accumulator (CSA) for shortening the critical path delay. The proposed designs are synthesized by Cadence Genus using GPDK 90-nm technology library and place-and-route using Cadence Innovus for different inner product lengths and word lengths. As found from the postlayout synthesis results, the proposed designs offer savings of nearly 30% and 29% EPC and ADP over the bit-serial DA-based design on average for word lengths 8 and 16 and inner product lengths 8, 16, and 32, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"953-962"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Flexible DA-Based Architecture for Computation of Inner Product of Variable Vectors\",\"authors\":\"Anil Kali;Samrat L. Sabat;Pramod Kumar Meher\",\"doi\":\"10.1109/TVLSI.2025.3528244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The computation of inner products of any given pair of vectors is an indispensable requirement in several applications including artificial intelligence (AI), machine learning (ML), signal processing, image processing, communication, and many others. The throughput requirement of inner product computation varies widely for different applications. Moreover, the throughput of computation must match the requirements of the applications. It is therefore important to design flexible hardware for inner product computation that produces the desired throughput. Distributed arithmetic (DA) is a well-known approach for efficient inner product computation. This article presents an efficient DA-based architecture for computing the inner product of variable vectors, which could be tailored according to the throughput requirement of any given application and reused for different inner product lengths. The proposed designs could also be deployed to achieve a trade-off between throughput and area/energy consumption. In this article, we have used modified Booth encoding (MBE) to reduce the number of partial products and proposed a novel carry-save accumulator (CSA) for shortening the critical path delay. The proposed designs are synthesized by Cadence Genus using GPDK 90-nm technology library and place-and-route using Cadence Innovus for different inner product lengths and word lengths. As found from the postlayout synthesis results, the proposed designs offer savings of nearly 30% and 29% EPC and ADP over the bit-serial DA-based design on average for word lengths 8 and 16 and inner product lengths 8, 16, and 32, respectively.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 4\",\"pages\":\"953-962\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10871186/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10871186/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

计算任意给定向量对的内积在人工智能（AI）、机器学习（ML）、信号处理、图像处理、通信等许多应用中都是必不可少的。内积计算的吞吐量要求在不同的应用中差别很大。此外，计算的吞吐量必须与应用程序的需求相匹配。因此，为内积计算设计灵活的硬件以产生所需的吞吐量是很重要的。分布式算法是一种众所周知的高效内积计算方法。本文提出了一种高效的基于数据分析的计算变量向量内积的体系结构，该体系结构可以根据任何给定应用程序的吞吐量要求进行定制，并可用于不同内积长度的重用。提出的设计也可以用于实现吞吐量和面积/能耗之间的权衡。在本文中，我们使用改进的Booth编码（MBE）来减少部分乘积的数量，并提出了一种新的免进位累加器（CSA）来缩短关键路径延迟。所提出的设计是由Cadence Genus使用GPDK 90纳米技术库合成的，并使用Cadence Innovus对不同的内产物长度和字长进行放置和布线。从布局后合成结果中发现，在字长8和16以及内积长度8、16和32的情况下，所提出的设计比基于位串行数据的设计平均节省近30%和29%的EPC和ADP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Flexible DA-Based Architecture for Computation of Inner Product of Variable Vectors

The computation of inner products of any given pair of vectors is an indispensable requirement in several applications including artificial intelligence (AI), machine learning (ML), signal processing, image processing, communication, and many others. The throughput requirement of inner product computation varies widely for different applications. Moreover, the throughput of computation must match the requirements of the applications. It is therefore important to design flexible hardware for inner product computation that produces the desired throughput. Distributed arithmetic (DA) is a well-known approach for efficient inner product computation. This article presents an efficient DA-based architecture for computing the inner product of variable vectors, which could be tailored according to the throughput requirement of any given application and reused for different inner product lengths. The proposed designs could also be deployed to achieve a trade-off between throughput and area/energy consumption. In this article, we have used modified Booth encoding (MBE) to reduce the number of partial products and proposed a novel carry-save accumulator (CSA) for shortening the critical path delay. The proposed designs are synthesized by Cadence Genus using GPDK 90-nm technology library and place-and-route using Cadence Innovus for different inner product lengths and word lengths. As found from the postlayout synthesis results, the proposed designs offer savings of nearly 30% and 29% EPC and ADP over the bit-serial DA-based design on average for word lengths 8 and 16 and inner product lengths 8, 16, and 32, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.