William Kolodziejski;Robson Domanski;Luciano Agostini
{"title":"FastGW: A Machine Learning-Based Early Skip for the AV1 Global Warped Motion Compensation","authors":"William Kolodziejski;Robson Domanski;Luciano Agostini","doi":"10.1109/TCSI.2024.3486243","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3486243","url":null,"abstract":"The growing consumption of digital media, driven by technological advancements and exacerbated by the COVID-19 pandemic, has led to an increased demand for efficient video compression techniques. Among the various video encoders available, the AOMedia Video 1 (AV1) stands out since it was defined by the Alliance for Open Media (AOMedia), which is formed by big techs such as Google, Amazon, NetFlix, Meta, and Intel, among others. AV1 was launched in 2018 and it reaches high compression rates, especially for high-resolution videos. However, AV1 computational cost is significantly higher when compared to other current codecs. This paper is focused on one of the main novelties introduced by AV1: the Global Warped Motion Compensation (GWMC) tool. A computational effort reduction approach called Fast Global Warped (FastGW), using machine learning, is proposed to reduce the GWMC processing time. Then, a decision tree was trained to decide whether to skip the GWMC’s most computationally intensive step: the Refinement. This decision tree was implemented inside the AV1 encoder, resulting in an average time reduction of 23% at the GWMC, with a minimal impact on coding efficiency of 0.14% in BD-BR on average. To the best of the authors’ knowledge, this is the first work in the literature exploring machine learning to reduce the AV1 GWMC computational effort.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"977-988"},"PeriodicalIF":5.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Noise-Tolerant Proximal Neurodynamic Algorithm for Solving MVIPs in Fixed-Time With Circuit Implementations and Applications","authors":"Shan Jiang;Ben Niu;Xingxing Ju;Hongyu Ma","doi":"10.1109/TCSI.2024.3488858","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3488858","url":null,"abstract":"In this paper, we propose a new noise-tolerant neurodynamic algorithm with fixed-time convergence to solve mixed variational inequality problems (MVIPs) and design the circuit framework for its hardware implementation. We prove that the proposed neurodynamic algorithm converges to a unique solution within fixed-time under some conditions and give its convergence time upper bound, which is independent of the initial states. Meanwhile, the robustness of the neurodynamic algorithm under additive perturbations is also demonstrated. In addition, the proposed neurodynamic algorithm is implemented using numerical simulation, analog circuits, and field-programmable gate array (FPGA) respectively. Finally, the superiority of the proposed algorithm is verified by two applications of image reconstruction and elastic net logistic regression.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1462-1471"},"PeriodicalIF":5.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bearing-Based Adaptive Cooperative Elliptical Circumnavigation Control for Multi-Agent Systems","authors":"Hongyu Ji;Xiang Li","doi":"10.1109/TCSI.2024.3484768","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3484768","url":null,"abstract":"This paper investigates the bearing-based cooperative circumnavigation control problems for multi-agent systems to enclose multiple static and moving targets on adaptively designed elliptical orbits. Firstly, a novel method based on estimated relative positions of agents with respect to targets is proposed to adaptively design the elliptical orbits to improve the adaptability of circumnavigation. The relative positions between agents and targets are estimated using only bearing measurements. A bearing-only circumnavigation control law is then proposed to enable a single agent to elliptically circumnavigate multiple targets on the adaptively designed orbit. For a group of agents, the control law for a single agent is extended by incorporating affine formation and collision avoidance. Furthermore, the effectiveness of the proposed orbit design method for enclosing targets and the convergence of the elliptical circumnavigation control laws are analyzed theoretically with their advantages verified by extensive simulations.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 4","pages":"1787-1799"},"PeriodicalIF":5.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shao Fei Bo;Jun-Hui Ou;Pei Ming Wang;Huaiguang Jiang;Xiu Yin Zhang
{"title":"Battery-Free Hybrid Ambient RF and Wind Energy Harvester for Outdoor IoTs","authors":"Shao Fei Bo;Jun-Hui Ou;Pei Ming Wang;Huaiguang Jiang;Xiu Yin Zhang","doi":"10.1109/TCSI.2024.3487262","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3487262","url":null,"abstract":"The paper proposes a hybrid RF and wind energy harvester. It is constructed by structural and functional integration of the two dissimilar energy harvesting techniques, constituting a conformal design. The rectifying efficiency can be boosted by the hybrid power source excitation, thereby increasing DC output power compared to standalone power source. A fan-shaped omnidirectional antenna and a hybrid single shunt-diode rectifier are designed to realize energy receiving and rectifying, respectively. A prototype is implemented and measured. The receiving part can achieve 2.29-dBi peak gain and 0.92-dB non-roundness at 1.85 GHz. It can also work smoothly when the wind speed varies from 0 to 12 m/s. The RF-DC conversion efficiency at -20 dBm and AC-DC output voltage at 12 m/s are measured as 20% and 79 mV, respectively. When both RF and wind power sources are accessible, the hybrid DC output power of 1.0 uW can be obtained with -30-dBm RF power and 10m/s wind speed. Moreover, the output power at hybrid rectifying mode is higher than that of simply superimposed RF and wind energy. Efficiency gain of up to 182% can be achieved. The hybrid energy harvester is a good candidate to power the sensors in battery-free IoTs.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1218-1228"},"PeriodicalIF":5.2,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huihong Shi;Haikuo Shao;Wendong Mao;Zhongfeng Wang
{"title":"Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer","authors":"Huihong Shi;Haikuo Shao;Wendong Mao;Zhongfeng Wang","doi":"10.1109/TCSI.2024.3485192","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3485192","url":null,"abstract":"Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs’ deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly Softmax, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with standard ViTs, we focus our attention towards the quantization and acceleration for efficient ViTs, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly. Specifically, at the algorithm level, we develop a tailored post-training quantization engine taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. Particularly, we can gain up to <inline-formula> <tex-math>$uparrow {3.6}times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$uparrow {5.0}times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$uparrow {7.3}times $ </tex-math></inline-formula> FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as <inline-formula> <tex-math>$uparrow {6.0}times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$uparrow {1.5}times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$uparrow {2.1}times $ </tex-math></inline-formula> DSP efficiency. Codes are available at <uri>https://github.com/shihuihong214/Trio-ViT</uri>.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1296-1307"},"PeriodicalIF":5.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianfei Wang;Chen Yang;Yishuo Meng;Fahong Zhang;Jia Hou;Siwei Xiang;Yang Su
{"title":"A Reconfigurable and Area-Efficient Polynomial Multiplier Using a Novel In-Place Constant-Geometry NTT/INTT and Conflict-Free Memory Mapping Scheme","authors":"Jianfei Wang;Chen Yang;Yishuo Meng;Fahong Zhang;Jia Hou;Siwei Xiang;Yang Su","doi":"10.1109/TCSI.2024.3483229","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3483229","url":null,"abstract":"Out-of-place constant-geometry (CG) NTT usually has a simple and uniform memory access pattern. However, out-of-place CG NTT always requires ping-pong memory, resulting in a memory capacity requirement of <inline-formula> <tex-math>$2N$ </tex-math></inline-formula>. Therefore, we propose a novel radix-4 in-place CG (IPCG) NTT/INTT that reduces the capacity requirement from <inline-formula> <tex-math>$2N$ </tex-math></inline-formula> to N. An area-efficient and dynamical reconfigurable polynomial multiplier (RAEPM) based on IPCG NTT is proposed to speed up polynomial multiplication over rings. In RAEPM, a Barrett modular multiplier using area-efficient radix-4 booth multiplier is designed to reduce area. In addition, an odd-bank buffer structure is proposed to achieve conflict-free memory mapping independent of polynomial length N and NTT/INTT stage. Moreover, we also proposed an efficient modular reduction for specific numbers and introduced a division equivalent method to eliminate the odd number modular reduction and odd number division in addressing. RAEPM is implemented on Xilinx VC709 FPGA and runs at 294MHz clock frequency. Compared with the prior pure NTT accelerators, under the same parameters, RAEPM achieves a decrease of 39.02% <inline-formula> <tex-math>$sim ~57.63$ </tex-math></inline-formula>% in area-time complexity of equivalent LUT, and a decrease of 15.97% <inline-formula> <tex-math>$sim ~49.24$ </tex-math></inline-formula>% in area-time complexity of equivalent FF. Compared with the prior NTT-based polynomial multipliers, under the same parameters, RAEPM achieves a decrease of 35.48% <inline-formula> <tex-math>$sim ~90.81$ </tex-math></inline-formula>% in area-time complexity of equivalent LUT, and a decrease of 24.24% <inline-formula> <tex-math>$sim ~88.41$ </tex-math></inline-formula>% in area-time complexity of equivalent FF.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1358-1371"},"PeriodicalIF":5.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Critical-Set-Based Multi-Bit Successive Cancellation List Decoder for Polar Codes: Algorithm and Implementation","authors":"Shan Cao;Shan Chen;Limin Jiang;Zhiyuan Jiang","doi":"10.1109/TCSI.2024.3485634","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3485634","url":null,"abstract":"With the evolution of wireless communication systems, there is a growing demand for high reliability and low latency in channel coding, particularly in 5G and beyond wireless systems used in applications such as autonomous driving and remote medical services. For the decoding of polar codes, the multi-bit successive cancellation list (MSCL) decoding technique was recently introduced to decrease the decoding latency by decoding several short inner codes in parallel, which preserves high reliability compared to the conventional successive cancellation list (SCL) decoding. However, as parallelism increases, the complexity of the decoding path sorting also increases significantly, which makes it resource-intensive for hardware implementation. To address this issue, this paper proposes a configurable critical-set-based multi-bit successive cancellation list (CS-MSCL) decoding algorithm, which first introduces critical sets to the MSCL decoding for the optimization of path pruning. Subsequently, an enhanced CS-MSCL algorithm is introduced for large list-size MSCL decoding, which can boost the error correction performance. Then, an area-efficient decoding architecture is introduced, which supports the cyclic redundancy check (CRC) and the CS-MSCL decoding compatible with the 5G standard. The proposed decoder is implemented in SMIC 40 nm CMOS technology with a parallelism degree of 8, which has a peak area efficiency of <inline-formula> <tex-math>$4.64~mathrm {Gbps/mm^{2}}$ </tex-math></inline-formula> for list size 4 and <inline-formula> <tex-math>$2.01~mathrm {Gbps/mm^{2}}$ </tex-math></inline-formula> for list size 8. Compared to state-of-the-art SCL-based decoders, the normalized area efficiency is improved by 7.16% and 17.54% for list sizes 4 and 8, respectively.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1472-1485"},"PeriodicalIF":5.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memristor-Based Neural Network Circuit of Full-Function Pavlov Associative Memory With Unconditioned Response Mechanisms","authors":"Zhixia Ding;Zhirui Chen;Sai Li;Zicheng Li;Le Yang","doi":"10.1109/TCSI.2024.3485163","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3485163","url":null,"abstract":"Most memristor-based Pavlovian associative memory neural networks explore the impact of relearning and forgetting on associative memory rates. In this paper, a memristor-based neural network circuit with unconditioned response mechanisms is proposed. The circuit is composed of neuron modules, synapse modules, forgetting voltage control modules, and effective time judgment modules. The proposed circuit achieves control of the forgetting rate, which can be adjusted based on the interval of the unconditioned stimuli. Additionally, the concept of effective time judgment has been introduced. If the stimulus interval exceeds the range of effective time, the circuit will automatically proceed with natural forgetting. Furthermore, the synaptic weight between the food neuron and the salivation neuron is no longer fixed but can change based on the method of food presentation. When only looking at the food but not eating it, the connection between the visual neuron and the salivation neuron decreases, resulting in reduced salivation when seeing the food. However, after eating the food, the connection between the visual neuron and the salivation neuron quickly strengthens. This circuit further refines the functionality of neural network circuits based on the mechanisms of unconditional reflexes. Finally, these functions are validated using PSPICE.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 4","pages":"1574-1586"},"PeriodicalIF":5.2,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Incremental Time-Domain Mixed-Signal Matrix-Vector-Multiplication Technique for Low-Power Edge-AI","authors":"Kévin Hérissé;Benoit Larras;Bruno Stefanelli;Andreas Kaiser;Antoine Frappé","doi":"10.1109/TCSI.2024.3480154","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3480154","url":null,"abstract":"This paper proposes a time-domain mixed-signal computing architecture for Matrix-Vector Multiplication suited for embedded in-memory computing applications. The system leverages the low data rate of sensors’ data in embedded AI applications to target an energy-efficient implementation of the matrix-vector multiplication array. The mixed-signal computing scheme relies on incremental time-domain multiply-and-accumulate operations using switched current sources. The concept is demonstrated on a 28nm FDSOI prototype chip of a 100\u0000<inline-formula> <tex-math>$times $ </tex-math></inline-formula>\u00004 compute array that shows a 15.8TOPS/W energy efficiency for 5-bit MAC operations. Extrapolating the array to 100\u0000<inline-formula> <tex-math>$times $ </tex-math></inline-formula>\u0000100 computing units leads to a 99.2TOPS/W energy efficiency.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"71 12","pages":"6470-6481"},"PeriodicalIF":5.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization","authors":"Bowen Yao;Liansheng Liu;Yu Peng;Xiyuan Peng;Rong Xu;Heming Liu;Haijun Zhang","doi":"10.1109/TCSI.2024.3483308","DOIUrl":"https://doi.org/10.1109/TCSI.2024.3483308","url":null,"abstract":"Fine-grained sparse convolutional neural networks (CNNs) achieve a better trade-off between model accuracy and size than coarse-grained sparse CNNs. Due to irregular data structures and unbalanced computation loads, fine-grained sparse CNNs struggle to fully leverage the performance advantages of computation and storage on general-purpose edge hardware. However, existing custom sparse accelerators are designed from the perspective of emulating a balanced load by software or computational strategies, neglecting the exploration of the computing architecture’s adaptability and parallelism for fine-grained sparse models. To address these challenges, a cross-mesh NoC-based accelerator architecture is proposed. This architecture aligns with the irregular characteristics of fine-grained sparse CNN weights and enhances the spatio-temporal parallelism of fine-grained sparse CNNs. First, a sparse multiplier unit (SMU) array and an adder array are designed to enable parallel execution of convolution multiplication and accumulation operations. Then, element-wise unroll-based nonzero weight multiplication is mapped to the SMU array to provide more flexible spatial parallelism. A horizontal and vertical cross-mesh NoC is proposed for flexible dataflow scheduling between the SMU and adder arrays to further improve temporal parallelism. This architecture allows the multiplication and accumulation operations in convolution to be decoupled and pipelined with negligible latency. Finally, the proposed accelerator architecture is implemented on the ZU9EG platform. The experimental results show that the proposed accelerator achieves frame rates of 509.9, 249.3, 100.7, 48.4, and 168.9 frames per second (FPS) for AlexNet, VGG-16, ResNet-18, MobileNet-v2, and EfficientNet, respectively. Compared with related works, this accelerator achieves inference speed and energy efficiency improvements of <inline-formula> <tex-math>$1.1times sim 36.1times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.4times sim 13.4times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 4","pages":"1634-1647"},"PeriodicalIF":5.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}