IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

筛选
英文 中文
A Parallel Architecture and Implementation for Near-Lossless Hyperspectral Image Compression Based on CCSDS 123.0-B-2 With Scalable Data-Rate Performance 基于 CCSDS 123.0-B-2 的近乎无损高光谱图像压缩并行架构与实现,具有可扩展的数据速率性能
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3415505
Panagiotis Chatziantoniou;Antonis Tsigkanos;Dimitris Theodoropoulos;Nektarios Kranitis;Antonis Paschalis
{"title":"A Parallel Architecture and Implementation for Near-Lossless Hyperspectral Image Compression Based on CCSDS 123.0-B-2 With Scalable Data-Rate Performance","authors":"Panagiotis Chatziantoniou;Antonis Tsigkanos;Dimitris Theodoropoulos;Nektarios Kranitis;Antonis Paschalis","doi":"10.1109/TVLSI.2024.3415505","DOIUrl":"10.1109/TVLSI.2024.3415505","url":null,"abstract":"Hyperspectral and multispectral imaging maintains a crucial role in remote sensing technology for Earth observation missions. However, the huge volume of produced data requires compression for storage and downlink transmission. In 2019, the Consultative Committee for Space Data Systems (CCSDS) released the CCSDS 123.0-B-2 recommended standard, allowing near-lossless compression, through a closed-loop quantizer, by introducing a Hybrid Entropy Coder option. However, the in-loop quantizer introduced additional data dependencies constituting a throughput performance bottleneck. This contribution addresses the need for high data-rate on-board compression by presenting an efficient parallel architecture and hardware implementation based on CCSDS 123.0-B-2. It bypasses the throughput performance bottleneck with an external, hardware-efficient quantizer while maintaining competitive quality near-lossless functionality with compatibility to the CCSDS standard. The parallel architecture leverages segmentation along the X-axis of the spectral cube, enabling scalable data-rate performance with constant embedded memory footprint. The introduced architecture is implemented in VHSIC hardware description language (VHDL) indicatively targeting Xilinx Kintex UltraScale technology, validated and demonstrated using state-of-the-art SpaceFibre serial link interface IP Cores and test equipment, achieving very high code coverage. A single hyperspectral compression engine (HCE) achieves throughput performance of 285 MSamples/s (4.56 Gb/s) at 1.68 W, while six parallel HCEs reach 1590 MSamples/s (25.44 Gb/s) at 6.12 W, measured on a full breadboard system. Maximum performance only depends on image dimensions, available field programmable gate array (FPGA) resources and high-speed serial interface technology. To the best of our knowledge, this implementation achieves the highest data-rate performance for near-lossless compression based on CCSDS 123.0-B-2 implemented in FPGA technology suitable for next-generation institutional missions.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Injection-Locked and Sub-Sampling Clock Multiplier With a Two-Step SC DAC Achieving 2.67% Jitter Variation 带有两级 SC DAC 的注入锁定和子采样时钟乘法器,抖动变化率为 2.67
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3417015
Qifeng Huang;Siji Huang;Yanhang Chen;Yifei Fan;Jie Yuan
{"title":"An Injection-Locked and Sub-Sampling Clock Multiplier With a Two-Step SC DAC Achieving 2.67% Jitter Variation","authors":"Qifeng Huang;Siji Huang;Yanhang Chen;Yifei Fan;Jie Yuan","doi":"10.1109/TVLSI.2024.3417015","DOIUrl":"10.1109/TVLSI.2024.3417015","url":null,"abstract":"This article presents an injection-locked clock multiplier (ILCM) using a digitally controlled frequency-tracking loop (FTL) with an integral two-step switched-capacitor (SC) digital-to-analog converter (DAC). Conventionally, the DAC resolution needs to be increased for low noise at the cost of degraded monotonicity due to device mismatch. To overcome this tradeoff, the proposed DAC utilizes the SC technique to achieve fine steps. With only two capacitors involved in charge transfer, the DAC is inherently monotonic, avoiding the boundary-crossing issue and the mismatch calibration. A control-voltage-tracking loop (CVTL) further suppresses the quantization noise by balancing the up and down step sizes and helps achieve a 16-bit-level voltage step. The FTL is sub-sampling and utilizes a bang-bang phase detector (BBPD). Locking at 700 MHz, the ILCM achieves a 0.9-ps integrated jitter, a -125-dBc/Hz phase noise at a 1-MHz offset, and a small jitter variation of 2.67% under different supply voltages and temperatures. With FTL, the spur is around -56 dBc from the prototype fabricated in a 180-nm CMOS process. The chip occupies a core area of 0.054 mm2 and consumes \u0000<inline-formula> <tex-math>$689~mu $ </tex-math></inline-formula>\u0000W from a 1.8-V supply, achieving an FoM of -242.5 dB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems 基于深度强化学习的芯片组多核系统电源管理
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-26 DOI: 10.1109/TVLSI.2024.3415487
Xiao Li;Lin Chen;Shixi Chen;Fan Jiang;Chengeng Li;Wei Zhang;Jiang Xu
{"title":"Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems","authors":"Xiao Li;Lin Chen;Shixi Chen;Fan Jiang;Chengeng Li;Wei Zhang;Jiang Xu","doi":"10.1109/TVLSI.2024.3415487","DOIUrl":"10.1109/TVLSI.2024.3415487","url":null,"abstract":"Chiplet technology has emerged as a promising solution to address the increasing demand for high-performance computing in light of the slowdown of Moore’s law. While chiplet-based multicore systems offer higher performance through heterogeneous integration, they also pose challenges for power delivery system (PDS) design. The integration of additional vertical and inter-chiplet connections, along with higher power density, impose stringent requirements on power delivery. Moreover, PDS efficiency is affected by workload variations at runtime, necessitating the need to design and manage PDSs and processors as a whole to improve system energy efficiency while balancing performance. In this article, we propose an offline-online co-design optimization methodology that combines offline PDS design optimization with online power management. To address the power consumption and delivery mismatch, we introduce a centralized deep Q-network (DQN)-based online control scheme for power co-management in chiplet-based multicore systems. By carefully designing the state space and reward functions, our approach achieves workload-aware adaptive control to reduce the energy-delay-product (EDP) while maintaining PDS efficiency under a given performance target (PT). We conduct evaluations on realistic applications to validate the effectiveness of our approach. For 64-core systems, our method achieves an average EDP reduction of 67% while meeting a 90% PT, surpassing state-of-the-art modular Q-learning (MQL)-based and heuristic-based approaches by up to 4% and 16%, respectively. Additionally, our approach demonstrates wiser action selection policies, higher control stability, and lower implementation overhead compared to the MQL-based approach.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core ABS:设计低精度张量核心的累积位宽缩放方法
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-25 DOI: 10.1109/TVLSI.2024.3414260
Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen
{"title":"ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core","authors":"Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen","doi":"10.1109/TVLSI.2024.3414260","DOIUrl":"10.1109/TVLSI.2024.3414260","url":null,"abstract":"A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 28-nm Dual-Mode Explicit Class-F₂₃ VCO With Low-Loss CM Return Path Achieving 70–400-kHz 1/f³ PN Corner Over 4.9–7.3-GHz TR 具有低损耗 CM 返回路径的 28 纳米双模显式 F$_{23}$ 类 VCO,可在 4.9-7.3-GHz TR 范围内实现 70-400-kHz 1/$f^{3}$ PN 波角
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-25 DOI: 10.1109/TVLSI.2024.3414158
Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu
{"title":"A 28-nm Dual-Mode Explicit Class-F₂₃ VCO With Low-Loss CM Return Path Achieving 70–400-kHz 1/f³ PN Corner Over 4.9–7.3-GHz TR","authors":"Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu","doi":"10.1109/TVLSI.2024.3414158","DOIUrl":"10.1109/TVLSI.2024.3414158","url":null,"abstract":"This brief presents an explicit Class-F23 voltage-controlled oscillator (VCO). The square-like voltage waveform is obtained via waveform shaping, and flicker noise upconversion is suppressed by a proper common-mode (CM) return path. CM resonance at the second harmonic frequency is introduced by a compact octagonal inductor. The rms value of the impulse sensitivity function (ISF) is significantly reduced through Class-F23 operation. The VCO switches between two modes of a high-order LC resonator consisting of two identical LC tanks coupled by capacitors. A prototype of the VCO is implemented in a 28-nm CMOS. Measurements show a continuous tuning range (TR) of 4.89–7.29 GHz, with a peak figure of merit (FoM) of 190.5 dB/Hz at 5.8 GHz and better than 188.5 dB across the entire TR. The flicker phase-noise corner ranges from 70 to 400 kHz. The VCO consumes 16–19 mW from a 0.5-V supply and occupies an active area of 0.21 mm2.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
1.63 pJ/SOP Neuromorphic Processor With Integrated Partial Sum Routers for In-Network Computing 1.63 pJ/SOP 神经形态处理器,集成部分和路由器,用于网内计算
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3409652
Dongrui Li;Ming Ming Wong;Yi Sheng Chong;Jun Zhou;Mohit Upadhyay;Ananta Balaji;Aarthy Mani;Weng Fai Wong;Li Shiuan Peh;Anh Tuan Do;Bo Wang
{"title":"1.63 pJ/SOP Neuromorphic Processor With Integrated Partial Sum Routers for In-Network Computing","authors":"Dongrui Li;Ming Ming Wong;Yi Sheng Chong;Jun Zhou;Mohit Upadhyay;Ananta Balaji;Aarthy Mani;Weng Fai Wong;Li Shiuan Peh;Anh Tuan Do;Bo Wang","doi":"10.1109/TVLSI.2024.3409652","DOIUrl":"10.1109/TVLSI.2024.3409652","url":null,"abstract":"Neuromorphic computing is promising to achieve unprecedented energy efficiency by emulating the human brain’s mechanism. Conventional neuromorphic accelerators employ split-and-merge method to map spiking neural networks’ inputs to surpass the fan-in capabilities of a single neuron core. However, this approach gives rise to the risk of accuracy compromise and extra core usage for the merging process. Moreover, it requires excessive data movement and clock cycles to aggregate spikes generated by partial sums instead of total sums obtained from different cores with substantial power and energy overhead. This work presents a novel approach to addressing the challenges imposed by the split-and-merge method. We propose an energy-efficient, reconfigurable neuromorphic processor that leverages several key techniques to mitigate the above issues. First, we introduce a partial sum router circuitry that enables in-network computing (INC), eliminating the need for extra merge cores. Second, we adopt software-defined Networks-on-Chip (NoCs) by leveraging predefined, efficient routing, eliminating power-hungry routing computation. At last, we incorporate fine-grained power gating and clock gating techniques for further power reduction. Experimental results from our test chip demonstrate the lossless mapping of the algorithm and exceptional energy efficiency, achieving an energy consumption of 1.63 pJ/SOP at 0.48 V. This energy efficiency represents a 22.4% improvement compared to the state-of-the-art results. Our proposed neuromorphic processor provides an efficient and flexible solution for neural network processing, mitigating the limitations of the traditional split-and-merge approach while delivering superior energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 206 μW Vital Signs Monitoring System on Chip for Measuring Five Vitals 用于测量五种生命体征的 206 $mu$W 片上生命体征监测系统
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3415469
Sameen Minto;Austin Cable;Wala Saadeh
{"title":"A 206 μW Vital Signs Monitoring System on Chip for Measuring Five Vitals","authors":"Sameen Minto;Austin Cable;Wala Saadeh","doi":"10.1109/TVLSI.2024.3415469","DOIUrl":"10.1109/TVLSI.2024.3415469","url":null,"abstract":"This article presents an area and power-efficient system-on-chip (SoC) for vital signs monitoring to provide patients with remote monitoring. It measures five important vitals including blood oxygen saturation (SpO2), respiration rate (RR), heart rate (HR), HR variability (HRV), and temperature. The proposed SoC utilizes a photoplethysmography (PPG) signal to compute HR, HRV, SpO2, and RR. The PPG signal is amplified and filtered using a PPG readout that includes a transimpedance amplifier (TIA) with a switched integrator (SI) to filter and amplify the signal. A differential second-order, delta-sigma analog-to-digital converter (\u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC) is adopted to digitize the PPG signal. The SoC also comprises a low-power LED driver for both red and infrared (IR) LEDs which operate in pulsed mode with a 0.625% duty cycle. A vital signs extractor performs feature extraction (FE) and computes the vital signs with a maximum absolute error of less than 1%. In this work, the temperature is also measured by employing a Wheatstone bridge (WhB)-based temperature sensor which integrates thermal resistors into a second-order \u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC. The proposed system shares \u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000-ADC for digitizing the PPG signal and the temperature readings to reduce both area and power consumption. The proposed system computes the temperature over the human’s temperature range (\u0000<inline-formula> <tex-math>$32~^{circ }$ </tex-math></inline-formula>\u0000 C to \u0000<inline-formula> <tex-math>$42~^{circ }$ </tex-math></inline-formula>\u0000 C) with an accuracy of +/\u0000<inline-formula> <tex-math>$- 0.09~^{circ }$ </tex-math></inline-formula>\u0000 C. The SoC is implemented using a 180 nm CMOS process with an area of 4.8 mm2 while consuming \u0000<inline-formula> <tex-math>$206~mu $ </tex-math></inline-formula>\u0000 W.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLSI Design of Light-Field Factorization for Dual-Layer Factored Display 用于双层因式显示器的光场因式化 VLSI 设计
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3414262
Li-De Chen;Li-Qun Weng;Hao-Chien Cheng;An-Yu Cheng;Kai-Ping Lin;Chao-Tsung Huang
{"title":"VLSI Design of Light-Field Factorization for Dual-Layer Factored Display","authors":"Li-De Chen;Li-Qun Weng;Hao-Chien Cheng;An-Yu Cheng;Kai-Ping Lin;Chao-Tsung Huang","doi":"10.1109/TVLSI.2024.3414262","DOIUrl":"10.1109/TVLSI.2024.3414262","url":null,"abstract":"This article introduces a VLSI design for light-field factorization, aimed at enhancing immersive 3-D visual experiences for computational light-field factored displays. The main design challenges are intensive memory-access demands and high computational complexity. Accordingly, we first propose half-block-based factorization (HBBF) and sparse ray sampling (SRS) to reduce DRAM bandwidth by 99% and SRAM size by 74%. Then, we devise integer hybrid quantization (INTH) to cut down computational logic by 41%, leading to improvements in die area and power efficiency. Finally, we fabricated a processor chip that incorporates 75.1 kB of SRAM and 5.9M logic gates using 40-nm CMOS technology. It can operate with three different performance modes: high quality (56.9 MPixel/s at 971 mW), balanced (62.5 MPixel/s at 442 mW), and low power (61.7 MPixel/s at 283 mW). Across these modes, its normalized energy ranges between 4.4 and 16.2 nJ/pixel. This implementation surpasses existing GPU platforms and offers an \u0000<inline-formula> <tex-math>$85times $ </tex-math></inline-formula>\u0000 increase in processing speed and a \u0000<inline-formula> <tex-math>$311times $ </tex-math></inline-formula>\u0000 reduction in power consumption. We also showcase a real-time computational 3-D display system with this chip, demonstrating its practical efficacy in computational 3-D display technology.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters 在共享 L1 内存的多核集群中实现高效混合 Systolic 计算
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-24 DOI: 10.1109/TVLSI.2024.3415486
Sergio Mazzola;Samuel Riedel;Luca Benini
{"title":"Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters","authors":"Sergio Mazzola;Samuel Riedel;Luca Benini","doi":"10.1109/TVLSI.2024.3415486","DOIUrl":"10.1109/TVLSI.2024.3415486","url":null,"abstract":"Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array’s processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster’s shared memory. We introduce two low-overhead RISC-V instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture’s trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool’s compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80 V/25 °C), in a 22-nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis and Optimization of Sense-and-Set Piezoelectric Energy Harvesting Interface Circuits 感应和设置压电能量收集接口电路的分析与优化
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-06-20 DOI: 10.1109/TVLSI.2024.3409668
Loai G. Salem
{"title":"Analysis and Optimization of Sense-and-Set Piezoelectric Energy Harvesting Interface Circuits","authors":"Loai G. Salem","doi":"10.1109/TVLSI.2024.3409668","DOIUrl":"10.1109/TVLSI.2024.3409668","url":null,"abstract":"This article presents the modeling and optimization of a sense-and-set (SaS) rectifier. The basic equations governing the operation of a SaS rectifier are derived analytically using Laplace-transform techniques. An expression for the harvesting efficiency of a SaS rectifier is developed by evaluating the conduction and gate-drive losses as well as the output power of the rectifier. The derived expressions are then employed to locate the optimal design point of a SaS interface circuit. The proposed modeling approach reduces the required run time by more than 2000 times as compared to SPICE simulation without sacrificing accuracy. The following design parameters are determined for maximum efficiency: optimal relative size between the rectifier switches, total conductance of the rectifier, and sensing frequency. The close match between the theoretical expressions and circuit simulation results validates the proposed analysis.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信