IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

筛选
英文 中文
A Study on Nonlinearity in Mixers Using a Time-Varying Volterra-Based Distortion Contribution Analysis Tool 使用基于时变伏特拉失真贡献分析工具的混频器非线性研究
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-17 DOI: 10.1109/TVLSI.2024.3474183
Negar Shabanzadeh;Aarno Pärssinen;Timo Rahkonen
{"title":"A Study on Nonlinearity in Mixers Using a Time-Varying Volterra-Based Distortion Contribution Analysis Tool","authors":"Negar Shabanzadeh;Aarno Pärssinen;Timo Rahkonen","doi":"10.1109/TVLSI.2024.3474183","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3474183","url":null,"abstract":"To optimize the linearity of mixers, one needs to recognize the origins and mixing mechanisms of dominant nonlinearities. This article presents a time-varying (TV) nonlinear analysis, where TV polynomial models are used to yield harmonic mixing functions in four simple mixer structures. Local oscillator (LO) waveform engineering has been utilized to tailor the nonlinearity coefficients, which were later fed into a numerical distortion contribution analysis tool to calculate the nonlinearity using a Volterra series-based approach. The spectra of the nonlinear voltages have been drawn, followed by plots breaking the final IM3 distortion down into its contributions. The effect of filtering on distortion has been illustrated in the end.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2232-2242"},"PeriodicalIF":2.8,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10721238","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC 互连主导多核3D-IC的带宽-延迟-热协同优化
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-16 DOI: 10.1109/TVLSI.2024.3467148
Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic
{"title":"Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC","authors":"Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic","doi":"10.1109/TVLSI.2024.3467148","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3467148","url":null,"abstract":"The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases <inline-formula> <tex-math>$T_{max }$ </tex-math></inline-formula> only by <inline-formula> <tex-math>$2~^{circ } $ </tex-math></inline-formula>C–<inline-formula> <tex-math>$3~^{circ } $ </tex-math></inline-formula>C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"346-357"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Real-Time Rotation Calibration for Interchannel Offset Mismatch in Time-Interleaved SAR ADCs
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-16 DOI: 10.1109/TVLSI.2024.3472095
Yixiao Luo;Hongzhi Liang;Zeyu Peng;Yukui Yu;Shubin Liu;Ruixue Ding;Zhangming Zhu
{"title":"A Real-Time Rotation Calibration for Interchannel Offset Mismatch in Time-Interleaved SAR ADCs","authors":"Yixiao Luo;Hongzhi Liang;Zeyu Peng;Yukui Yu;Shubin Liu;Ruixue Ding;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3472095","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472095","url":null,"abstract":"This brief presents an on-chip, real-time rotation calibration (RRC) technique aimed at alleviating the inter-channel offset mismatch in time-interleaved (TI) successive-approximation register analog-to-digital converter (SAR ADC). By leveraging auto-rotation calibration and self-compensation strategies in the analog domain, the proposed technique demonstrates robust performance across PVT variations. Two additional sub-channels are involved in the TI quantization mechanism, where the continuous rotation of the sampling clock distribution ensures their operation in calibration mode. To validate the effectiveness of the proposed calibration, an <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> bit 8 GS/s TI-SAR ADC is designed and implemented in a 28-nm process and occupies an active area of 0.273 mm2, with each sub-channel SAR ADC covering only <inline-formula> <tex-math>$86times 23~mu $ </tex-math></inline-formula>m. Extensive simulation results validate the efficacy of RRC, demonstrating significant improvements in dynamic performance. Specifically, SNDR increases from 37.1 to 45.4 dB, while SFDR rises from 57.8 to 60.7 dB, as observed at the Nyquist input frequency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"897-901"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications 边缘PoolFormer:边缘ai应用中基于RRAM Crossbar的PoolFormer网络建模与训练
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3472270
Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh
{"title":"Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications","authors":"Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh","doi":"10.1109/TVLSI.2024.3472270","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472270","url":null,"abstract":"PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with <inline-formula> <tex-math>$64times 64$ </tex-math></inline-formula> four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"384-394"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 10-Gb/s/lane, Energy-Efficient Transceiver With Reference-Less Hybrid CDR for Mobile Display Link Interfaces
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3472073
Jonghyun Oh;Kwanseo Park;Young-Ha Hwang
{"title":"A 10-Gb/s/lane, Energy-Efficient Transceiver With Reference-Less Hybrid CDR for Mobile Display Link Interfaces","authors":"Jonghyun Oh;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2024.3472073","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472073","url":null,"abstract":"This brief presents an energy-efficient transceiver supporting a 10-Gb/s/lane display link interface between the application processor (AP) integrated circuits (IC) and timing controller (TCON)-embedded source driver IC for mobile applications. An embedded clocking scheme is adopted to save clock distribution power, which also reduces the required number of off-chip I/O channels. A transmitter (TX) sends 20-Gb/s aggregate data through two differential data lanes, and a receiver recovers a 5-GHz half-rate clock. The TX employs a latch-less serializer using divided clocks in a staggered phase, achieving energy efficiency of 0.43 pJ/b/lane. In the RX, a hybrid clock and data recovery (CDR) tracks a half-data rate with a digital loop filter (DLF) and subsequently locks the frequency and phase with an analog loop filter (ALF). By deactivating the DLF and edge deserializer once a coarse frequency lock is acquired, the RX achieves an energy efficiency of 0.53 pJ/b/lane. The prototype transceiver, fabricated using a 28-nm CMOS technology, occupies an active area of 0.196 mm2 and achieves an energy efficiency of 1.23 pJ/b/lane, including a charge-pump phase-locked loop (CP-PLL) with clock distribution.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"887-891"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation 高效的ORBGRAND实现与并行噪声序列生成
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3466474
Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer
{"title":"Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation","authors":"Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer","doi":"10.1109/TVLSI.2024.3466474","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466474","url":null,"abstract":"Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (<inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> higher), 4.2-Mbps worst case throughput (<inline-formula> <tex-math>$8.24times $ </tex-math></inline-formula> higher), 2.4-Mbps/mm2 worst case area efficiency (<inline-formula> <tex-math>$12times $ </tex-math></inline-formula> higher), and <inline-formula> <tex-math>$4.66times 10 ^{{4}}$ </tex-math></inline-formula> pJ/bit worst case energy efficiency (<inline-formula> <tex-math>$9.96times $ </tex-math></inline-formula> lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide <inline-formula> <tex-math>$8.62times $ </tex-math></inline-formula> higher average throughput and <inline-formula> <tex-math>$9.4times $ </tex-math></inline-formula> higher average area efficiency but <inline-formula> <tex-math>$7.51times $ </tex-math></inline-formula> worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"435-448"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Combo EMI Suppression Scheme for Multimode PSR Flyback Converter
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3470837
Yongyuan Li;Zhuliang Li;Wei Guo;Qiang Wu;Yongbo Zhang;Yong You;Zhangming Zhu
{"title":"A Combo EMI Suppression Scheme for Multimode PSR Flyback Converter","authors":"Yongyuan Li;Zhuliang Li;Wei Guo;Qiang Wu;Yongbo Zhang;Yong You;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3470837","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470837","url":null,"abstract":"Electromagnetic interference (EMI) is an inevitable issue in power electronics applications. Although many kinds of solutions have been presented to attenuate EMI noise, there is still little research about the EMI suppression scheme utilized in multimode primary-side regulation (PSR) flyback converters. Targeting EMI regulation in multimode PSR flyback converter, a combo EMI suppression scheme comprised of frequency modulation and dual-slope gate driver is adopted to meet stringent EMI requirements, simplifying peripheral components and design of EMI filter. The proposed scheme is implemented in <inline-formula> <tex-math>$0.18~mu $ </tex-math></inline-formula> m 5/40 V BCD process and occupies a die size (with pads) of <inline-formula> <tex-math>$1.05times 0.8$ </tex-math></inline-formula> mm2. The experimental results show that the conducted EMI waveforms with line/neutral polarity can easily comply with regulations. The deviations of the output voltage are within ±1.3% under different inputs and loads while the peak effciency of 90% is achieved.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"892-896"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform 基于峰值神经网络和FPGA平台的交通标志识别硬件加速研究
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3470834
Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han
{"title":"Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform","authors":"Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han","doi":"10.1109/TVLSI.2024.3470834","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470834","url":null,"abstract":"Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"499-511"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators 3DNN-Xplorer:一个用于异构三维DNN加速器设计空间探索的机器学习框架
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3471496
Gauthaman Murali;Min Gyu Park;Sung Kyu Lim
{"title":"3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators","authors":"Gauthaman Murali;Min Gyu Park;Sung Kyu Lim","doi":"10.1109/TVLSI.2024.3471496","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471496","url":null,"abstract":"This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"358-370"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering 基于组对存储器访问和快速级间重排序的可扩展高效NTT/INTT体系结构
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3465010
Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang
{"title":"A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering","authors":"Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang","doi":"10.1109/TVLSI.2024.3465010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3465010","url":null,"abstract":"Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from <inline-formula> <tex-math>${O}(n^{2})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O[nlog _{2}(n)]$ </tex-math></inline-formula>. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost <inline-formula> <tex-math>$n/2$ </tex-math></inline-formula> processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to <inline-formula> <tex-math>$4.8times $ </tex-math></inline-formula> latency improvements and up to <inline-formula> <tex-math>$4.3times $ </tex-math></inline-formula> improvements on area time product (ATP).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"588-592"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信