{"title":"Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads","authors":"Mukul Lokhande;Gopal Raut;Santosh Kumar Vishvakarma","doi":"10.1109/TVLSI.2025.3553069","DOIUrl":null,"url":null,"abstract":"The rapid evolution of artificial intelligence (AI) models, from deep neural networks (DNNs) to transformers/large-language models (LLMs), demands flexible hardware solutions to meet diverse execution needs across edge and cloud platforms. Existing accelerators lack unified support for multiprecision arithmetic and runtime-configurable activation functions (AFs). This work proposes Flex-PE, a single instruction, multiple data (SIMD)-enabled multiprecision processing element that efficiently integrates multiply-and-accumulate operations with configurable AFs using unified hardware, including Sigmoid, Tanh, ReLU, and SoftMax. The proposed design achieves throughput improvements of up to <inline-formula> <tex-math>$16\\times $ </tex-math></inline-formula> FxP4, <inline-formula> <tex-math>$8\\times $ </tex-math></inline-formula> FxP8, <inline-formula> <tex-math>$4\\times $ </tex-math></inline-formula> FxP16, and <inline-formula> <tex-math>$1\\times $ </tex-math></inline-formula> FxP32, with maximum hardware efficiency for both iterative and pipelined architectures. An area-efficient iterative Flex-PE-based SIMD systolic array reduces DMA reads by up to <inline-formula> <tex-math>$62\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$371\\times $ </tex-math></inline-formula> for input feature maps and weight filters in VGG-16, achieving 8.42 GOPS/W energy efficiency with minimal accuracy loss (<2%). Flex-PE scales from 4-bit edge inference to FxP8/16/32, supporting edge and cloud high-performance computing (HPC) while providing high-performance adaptable AI hardware with optimal precision, throughput, and energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1610-1623"},"PeriodicalIF":2.8000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10950103/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid evolution of artificial intelligence (AI) models, from deep neural networks (DNNs) to transformers/large-language models (LLMs), demands flexible hardware solutions to meet diverse execution needs across edge and cloud platforms. Existing accelerators lack unified support for multiprecision arithmetic and runtime-configurable activation functions (AFs). This work proposes Flex-PE, a single instruction, multiple data (SIMD)-enabled multiprecision processing element that efficiently integrates multiply-and-accumulate operations with configurable AFs using unified hardware, including Sigmoid, Tanh, ReLU, and SoftMax. The proposed design achieves throughput improvements of up to $16\times $ FxP4, $8\times $ FxP8, $4\times $ FxP16, and $1\times $ FxP32, with maximum hardware efficiency for both iterative and pipelined architectures. An area-efficient iterative Flex-PE-based SIMD systolic array reduces DMA reads by up to $62\times $ and $371\times $ for input feature maps and weight filters in VGG-16, achieving 8.42 GOPS/W energy efficiency with minimal accuracy loss (<2%). Flex-PE scales from 4-bit edge inference to FxP8/16/32, supporting edge and cloud high-performance computing (HPC) while providing high-performance adaptable AI hardware with optimal precision, throughput, and energy efficiency.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.