{"title":"Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication","authors":"Sin-Wei Chiu;Keshab K. Parhi","doi":"10.1109/TVLSI.2025.3576782","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576782","url":null,"abstract":"Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2474-2487"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Soft Iterative Receiver With Simplified EP Detection for Coded MIMO Systems","authors":"Xiaosi Tan;Xiaohua Xie;Houren Ji;Tiancan Xia;Yongming Huang;Xiaohu You;Chuan Zhang","doi":"10.1109/TVLSI.2025.3536019","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3536019","url":null,"abstract":"Expectation propagation (EP) achieves excellent performance with high-order modulation in massive multiple-input multiple-output (MIMO) detection. The soft output of the EP detector can be iteratively combined with turbo soft decoders to enhance error-correction performance. However, the implementation of EP-based iterative detection and decoding (IDD) receivers suffer from an exponential increase in computational complexity as the number of antennas and modulation order grows. In this brief, we propose a simplified EP approximation-based IDD (sEPA-IDD) scheme for hardware implementation. To alleviate the computational burden, a simplified message update scheme is proposed, reducing complexity by 68% without performance degradation. Additionally, a unified design for extrinsic message computation further improves hardware utilization. Finally, we introduce the first unfolded EP-based IDD architecture to boost throughput. Compared with state-of-the-art (SOA) IDD receivers, the sEPA-IDD receiver implemented on 65 nm CMOS delivers a throughput of 3.07 Gb/s with a maximum 0.5 dB gain, achieving <inline-formula> <tex-math>$4.03times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$6.04times $ </tex-math></inline-formula> greater area efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1994-1998"},"PeriodicalIF":2.8,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RHT_NoC: A Reconfigurable Hybrid Topology Architecture for Chiplet-Based Multicore System","authors":"Dongyu Xu;Wu Zhou;Zhengfeng Huang;Huaguo Liang;Xiaoqing Wen","doi":"10.1109/TVLSI.2025.3572112","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572112","url":null,"abstract":"Chiplet-based system-on-chip (SoC) architectures, leveraging 2.5-D/3-D integration technologies, provide scalable solutions for a wide range of applications. Achieving high performance and cost-effectiveness in these systems relies heavily on optimizing die-to-die interconnect topologies and designs, which are essential for seamless interchiplet communication. This article introduces a reconfigurable hybrid topology (RHT) architecture designed for chiplet-based multicore systems. RHT achieves high performance and energy efficiency by dynamically reconfiguring the network topology to traffic variations, adaptively selecting transport subnets, and optimizing link bandwidth allocation, thereby minimizing congestion and maximizing packet throughput. Furthermore, RHT leverages global traffic information to dynamically combine Torus loops, maximizing opportunities for rapid packet transmission delivery while guaranteeing minimal hop counts. Moreover, RHT accelerates packet transmission via bufferless combined loops, extending the continuous sleeping periods of routers, improves power gating efficiency, and significantly reduces static power consumption. Simulation results indicate that the Mesh-DyRing achieves over a 40% reduction in network latency and more than a 20% decrease in power consumption overhead compared to the baseline design. When compared to WiNoC, an advanced hybrid wired-wireless topology design, the Mesh-DyRing-PG configuration reduces power consumption by 56.2% while maintaining equivalent average network latency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2104-2117"},"PeriodicalIF":2.8,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations","authors":"Anastasios Petropoulos;Theodore Antonakopoulos","doi":"10.1109/TVLSI.2025.3571677","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3571677","url":null,"abstract":"Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays (SAs), high-bandwidth memory (HBM), and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2334-2338"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of a Low-Power Analog Integrated Deep Convolutional Neural Network","authors":"Zisis Foufas;Vassilis Alimisis;Paul P. Sotiriadis","doi":"10.1109/TVLSI.2025.3573045","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573045","url":null,"abstract":"In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analyzed and used to realize a low-power analog classifier, consuming less than <inline-formula> <tex-math>$1.8~mu text {W}$ </tex-math></inline-formula>. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (postlayout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2172-2185"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Speed Compute-Efficient Bandit Learning for Many Arms","authors":"Ishaan Sharma;Sumit J. Darak;Rohit Kumar","doi":"10.1109/TVLSI.2025.3573924","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573924","url":null,"abstract":"Multiarmed bandits (MABs) are online machine learning algorithms that aim to identify the optimal arm without prior statistical knowledge via the exploration-exploitation tradeoff. The performance metric, regret, and computational complexity of the MAB algorithms degrade with the increase in the number of arms, <italic>K</i>. In applications such as wireless communication, radar systems, and sensor networks, <italic>K</i>, i.e., the number of antennas, beams, bands, etc., is expected to be large. In this work, we consider focused exploration-based MAB, which outperforms conventional MAB for large <italic>K</i>, and its mapping on various edge processors and multiprocessor system on a chip (MPSoC) via hardware-software co-design (HSCD) and fixed point (FP) analysis. The proposed architecture offers 67% reduction in average cumulative regret, 84% reduction in execution time on edge processor, 97% reduction in execution time using FPGA-based accelerator, and 10% savings in resources over state-of-the-art MABs for large <inline-formula> <tex-math>$K=100$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2099-2103"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi
{"title":"A Compact High-Speed Capacitive Data Transfer Link With Common Mode Transient Rejection for Isolated Sensor Interfaces","authors":"Isa Altoobaji;Ahmad Hassan;Mohamed Ali;Yves Audet;Ahmed Lakhssassi","doi":"10.1109/TVLSI.2025.3573226","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573226","url":null,"abstract":"In this article, a compact differential data transfer link architecture for isolated sensor interfaces (SIs) and immune to common mode transients (CMTs) is presented. The proposed architecture shows low latency supporting high-speed transmission with a low bit error rate (BER) in the presence of CMT noise for applications, such as data acquisition, biomedical equipment, and communication networks. In transportation applications, motors and actuators are subjected to harsh environmental conditions, e.g., lightning strikes and abnormal voltage operations. These conditions introduce noise and can cause damage to small electronics due to high-voltage power surges. To ensure human safety and circuitry protection, a data transfer system must be implemented between high-voltage and low-voltage domains. The proposed design has been simulated using Cadence tools, and a prototype has been manufactured in a 0.18-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m CMOS process. The fabricated prototype consumes an effective silicon area of <inline-formula> <tex-math>$37.2times 10^{3}~mu $ </tex-math></inline-formula>m<sup>2</sup> and can sustain a breakdown voltage of 710 V<sub>rms</sub>. Experimental results show that the proposed solution achieves a CMT immunity (CMTI) of 2.5 kV/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s at a data rate of 480 Mb/s with a BER of <inline-formula> <tex-math>$10^{-12}$ </tex-math></inline-formula>. The propagation delay is 3.9 ns with a 4 ps/°C variation rate over temperatures ranging from <inline-formula> <tex-math>$- 31~^{circ }$ </tex-math></inline-formula>C to <inline-formula> <tex-math>$100~^{circ }$ </tex-math></inline-formula>C. Under typical test conditions, the BER reaches <inline-formula> <tex-math>$10^{-15}$ </tex-math></inline-formula> with a peak-to-peak data dependent jitter (DDJ) of 29.8 ps.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2163-2171"},"PeriodicalIF":2.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung
{"title":"Toward High-Performance Network Coding: FPGA Acceleration With Bounded-Value Generators","authors":"Jiaxin Qing;Philip H. W. Leong;Kin-Hong Lee;Raymond W. Yeung","doi":"10.1109/TVLSI.2025.3572517","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572517","url":null,"abstract":"The network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field (FF) operations. Most literature focuses on algorithmic-level techniques to improve the coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present cyclic-shift BATS (CS-BATS), a hardware-friendly variant. Next, we propose a simple but effective bounded-value (BV) generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient field-programmable gate array (FPGA)-based network coding accelerator that achieves a throughput of 27 Gb/s, a speedup of more than 300 over software.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2274-2287"},"PeriodicalIF":2.8,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Compact Power-on-Reset Circuit With Configurable Brown-Out Detection","authors":"Yoochang Kim;Jun-Eun Park;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2025.3561131","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3561131","url":null,"abstract":"A compact power-on-reset (POR) circuit with a configurable brown-out reset (BOR) function is presented. An integrated voltage reference (VR) circuit provides a constant bias voltage that facilitates voltage-triggered POR/BOR operation, reliably preventing POR signal generation when the ramping supply voltage (<inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>) level is too low. Moreover, the proposed POR circuit features a fast, configurable POR/BOR operation owing to an inverter-based trip point detector (TPD), which triggers the reset signal with a programmable trip point. The prototype POR circuit achieves a POR level higher than 752 mV with a maximum POR delay of <inline-formula> <tex-math>$16.4~mu $ </tex-math></inline-formula>s at a 0.8–1.2-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>, supporting a wide range of supply ramping time from <inline-formula> <tex-math>$1~mu $ </tex-math></inline-formula>s to 1 s. In addition, the prototype detects brown-out events with a supply drop of 0.1–0.4 V, generating the BOR signal. Designed using a 28-nm CMOS process, the prototype has a compact active area of <inline-formula> <tex-math>$995.3~mu $ </tex-math></inline-formula>m<sup>2</sup> and a quiescent current of 162–974 nA at a 1-V <inline-formula> <tex-math>${V} _{text {DD}}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2074-2078"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Partial Recomputation-Based Fault Detection Approaches for Z-transform","authors":"Saeed Aghapour;Kasra Ahmadi;Mehran Mozaffari Kermani;Reza Azarderakhsh","doi":"10.1109/TVLSI.2025.3560154","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3560154","url":null,"abstract":"The Z-transform is a fundamental and strong tool being widely utilized in signal processing and various other applications such as communications and networking. By analyzing the Z-transform of a signal, one can extract critical information about its stability, causality, frequency response, energy and power, and overall behavior of the signal. However, errors caused either by environmental changes or malicious injections in large-scale integration (VLSI) implementations can critically compromise the integrity and reliability of its output. Failure to detect such faults may result in unpredictable, erroneous, and misleading function analyses. Therefore, the ability to detect soft errors and faults before accepting the results is of paramount importance. In this article, we propose an efficient fault detection method that combines algorithmic-level checks with partial recomputation to identify both transient and permanent faults with a high error coverage rate across various injection scenarios. The AMD/Xilinx field-programmable gate array (FPGA) implementation of our design demonstrated only a modest increase in time and area overhead. To the best of our knowledge, fault detection for the Z-transform function has not been previously studied.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1983-1993"},"PeriodicalIF":2.8,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}