Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou
{"title":"Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration","authors":"Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/JETCAS.2024.3437408","DOIUrl":"10.1109/JETCAS.2024.3437408","url":null,"abstract":"Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by \u0000<inline-formula> <tex-math>$5.36times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.62times $ </tex-math></inline-formula>\u0000, \u0000<inline-formula> <tex-math>$1.96times $ </tex-math></inline-formula>\u0000, and \u0000<inline-formula> <tex-math>$5.83times $ </tex-math></inline-formula>\u0000, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"440-454"},"PeriodicalIF":3.7,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized?","authors":"Hui Li;Jiahe Zhao;Feiyang Liu","doi":"10.1109/JETCAS.2024.3435721","DOIUrl":"10.1109/JETCAS.2024.3435721","url":null,"abstract":"With the development of silicon photonic interconnects, Optical Network-on-Chip (ONoC) becomes promising for multi-core/many-core communication. In ONoCs, both routing and wavelength assignment have an impact on the communication reliability and performance. However, the interactive impact of the routing and wavelength assignment is rarely considered. To fill this gap, this work proposes an adaptive and holistic method of routing and wavelength assignment (RWA) based on Reinforcement Learning (RL) for ONoCs. Routing and wavelength assignment is treated as a whole problem and participate in the same Markov decision process. Two corresponding implementation methods, i.e., distributed and centralized, are proposed, by using intelligent learning algorithms to process and learn the dynamic on-chip network information in multi-dimensional. Instead of considering routing and wavelength assignment separately in steps, the evaluation results show that the proposed holistic method improves by 2.58 dB, 9.21%, and 53.26% in the aspects of OSNR, waiting delay, and wavelength utilization respectively, in cost of 16.15% loss of load balancing. As for the distributed method and centralized method, the distributed method improves by 0.37 dB and 0.69% in the aspects of OSNR and waiting delay, but the centralized method improves by 13.84% and 4.46% in the aspects of load balancing and wavelength utilization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"534-550"},"PeriodicalIF":3.7,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing","authors":"Zhe Su;Simone Ramini;Demetra Coffen Marcolin;Alessandro Veronesi;Milos Krstic;Giacomo Indiveri;Davide Bertozzi;Steven M. Nowick","doi":"10.1109/JETCAS.2024.3433427","DOIUrl":"10.1109/JETCAS.2024.3433427","url":null,"abstract":"Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"409-424"},"PeriodicalIF":3.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10609786","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das
{"title":"Design and Analysis of 3D Integrated Folded Ferro-Capacitive Crossbar Array (FC²A) for Brain-Inspired Computing System","authors":"Sherin A. Thomas;Suyash Kushwaha;Rohit Sharma;Devarshi Mrinal Das","doi":"10.1109/JETCAS.2024.3432458","DOIUrl":"10.1109/JETCAS.2024.3432458","url":null,"abstract":"This paper presents a novel 3D folded capacitive synaptic crossbar array designed for in-memory computing architectures. In this architecture, the bitline is folded over the wordline to enhance the synaptic density. The proposed folded capacitive crossbar array (\u0000<inline-formula> <tex-math>$FC^{2}A$ </tex-math></inline-formula>\u0000) architecture decreases the wordline interconnect length and physical crossbar area by 50%. Thus, it helps to reduce the crossbar-associated parasitics and optimize space utilization. The proposed folded capacitive synaptic crossbar is used for designing a brain-inspired computing system (BiCoS) to recognize different patterns using CMOS technology. The BiCoS systems are prone to various reliability issues caused by the crossbar’s parasitics. Hence, the 3D folded capacitive crossbar’s Q3D model is developed to investigate the crossbar-associated parasitics and its effect on the proposed system is analyzed. The impact of crossbar parasitics is investigated for two cases: Firstly, how the three different spiking patterns (regular spiking, fast-spiking, and chattering) of the Izhikevich neuron change for the different crossbar sizes. Secondly, the impact is analyzed on the pattern recognition rate, which gets reduced to 70%. Addressing these challenges is critical to ensure the correct and robust working of the proposed system. Therefore, we propose a solution to effectively overcome and resolve these adverse effects. The energy consumed to recognize each pattern is calculated, and the average energy needed is \u0000<inline-formula> <tex-math>$0.25,nJ$ </tex-math></inline-formula>\u0000, which is significantly less when compared to the other state-of-the-art works. The circuit is implemented using 65nm standard CMOS technology.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"563-574"},"PeriodicalIF":3.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141775222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SwInt: A Non-Blocking Switch-Based Silicon Photonic Interposer Network for 2.5D Machine Learning Accelerators","authors":"Ebadollah Taheri;Mohammad Amin Mahdian;Sudeep Pasricha;Mahdi Nikdast","doi":"10.1109/JETCAS.2024.3429354","DOIUrl":"10.1109/JETCAS.2024.3429354","url":null,"abstract":"The surging demand for machine learning (ML) applications has emphasized the pressing need for efficient ML accelerators capable of addressing the computational and energy demands of increasingly complex ML models. However, the conventional monolithic design of large-scale ML accelerators on a single chip often entails prohibitively high fabrication costs. To address this challenge, this paper proposes a 2.5D chiplet-based architecture based on a silicon photonic interposer, called SwInt, to enable high bandwidth, low latency, and energy-efficient data movement on the interposer, for ML applications. Existing silicon photonic interposer implementations suffer from high power consumption attributed to their inefficient network designs, primarily relying on bus-based communication. Bus-based communication is not scalable, as it suffers from high power consumption of the optical laser due to cumulative losses on the readers and writers when the bandwidth per waveguide (i.e., wavelength division multiplexing degree) increases or the number of processing elements in ML accelerators scales up. SwInt incorporates a novel switch-based network designed using Mach-Zehnder Interferometer (MZI)-based switch cells for offering scalable interposer communication and reducing power consumption. The designed switch architecture avoids blocking using an efficient design, while minimizing the number of stages to offer a low-loss switch. Furthermore, the MZI switch cells are designed with a dividing state, enabling energy-efficient broadcast communication over the interposer and supporting broadcasting demand in ML accelerators. Additionally, we optimized and fabricated silicon photonic devices, Microring Resonators (MRRs) and MZIs, which are integral components of our network architecture. Our analysis shows that SwInt achieves, on average, 62% and 64% improvement in power consumption under, respectively, unicast and broadcast communication, resulting in 59.7% energy-efficiency improvement compared to the state-of-the-art silicon photonic interposers specifically designed for ML accelerators.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"520-533"},"PeriodicalIF":3.7,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10599539","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenPiton4HPC: Optimizing OpenPiton Toward High-Performance Manycores","authors":"Neiel Leyva;Alireza Monemi;Noelia Oliete-Escuín;Guillem López-Paradís;Xabier Abancens;Jonathan Balkind;Enrique Vallejo;Miquel Moretó;Lluc Alvarez","doi":"10.1109/JETCAS.2024.3428929","DOIUrl":"10.1109/JETCAS.2024.3428929","url":null,"abstract":"In recent years, numerous multicore RISC-V platforms have emerged. Development frameworks such as OpenPiton are employed in designs that aim to scale to a large number of cores. While OpenPiton presents a large flexibility, supporting different requirements and processing cores, some of its design decisions result in designs that are not optimized for High-Performance Computing (HPC) requirements. This work presents OpenPiton4HPC, an extension and optimization of OpenPiton for high-performance manycores. The key contributions are enabling multiple memory controllers, supporting router bypassing and NoC concentration, adding support for configurable cache sizes and cache block sizes, and allowing configurable bus widths in the NoC and in the cache SRAMs. On a 64-core manycore architecture, these new features and optimizations provide a geometric mean speedup of 7.2x compared to the OpenPiton baseline.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"395-408"},"PeriodicalIF":3.7,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantum Cryptanalysis of Affine Cipher","authors":"Mahima Mary Mathews;Panchami V;Vishnu Ajith","doi":"10.1109/JETCAS.2024.3428436","DOIUrl":"10.1109/JETCAS.2024.3428436","url":null,"abstract":"Quantum Algorithms reduce the computational complexity or solve certain difficult problems that were originally impossible to solve with classical computers. Grover’s search algorithm is a Quantum computation algorithm that can find target elements from a set of unstructured data with the best possible, \u0000<inline-formula> <tex-math>$O(sqrt {N})$ </tex-math></inline-formula>\u0000 queries. Grover’s search Quantum circuits implemented accurately can be used to successfully search and find the keys of Symmetric ciphers. However, very few demonstrations of such practical cryptanalysis are available. In this paper, practical Quantum cryptanalysis circuits for Affine Cipher are proposed and demonstrated, that successfully break the cipher by finding the keys.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"507-519"},"PeriodicalIF":3.7,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal
{"title":"HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge","authors":"Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal","doi":"10.1109/JETCAS.2024.3427421","DOIUrl":"10.1109/JETCAS.2024.3427421","url":null,"abstract":"Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to \u0000<inline-formula> <tex-math>$972times $ </tex-math></inline-formula>\u0000 improvement in latency and \u0000<inline-formula> <tex-math>$1600times $ </tex-math></inline-formula>\u0000 improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"425-439"},"PeriodicalIF":3.7,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Fidelity-Oriented Entanglement Distribution for Quantum Switches","authors":"Ziyue Jia;Lin Chen","doi":"10.1109/JETCAS.2024.3425712","DOIUrl":"10.1109/JETCAS.2024.3425712","url":null,"abstract":"We consider a star-shaped quantum network with a quantum switch in the center serving a number of requests, each characterized by two non-classical QoS requirements, the end-to-end entanglement delivery rate and the fidelity of the delivered entanglements. The central task of the switch is to allocate the limited entanglement resources among requests to maximize the system performance. We formulate the fundamental entanglement distribution problem where the switch decides 1) which requests to admit, and 2) as multiple requests may share a same quantum link, how to distributed the limited link-level entanglement resources among those competing requests. We then design a framework of joint entanglement purification scheduling and distribution for quantum switches. Our entanglement purification scheduling algorithm seeks to use minimal link-level entanglement resources to satisfy the QoS requirement of a single request. Our entanglement distribution algorithm further allocates the limited entanglement resources among multiple requests to maximize the overall utility by integrating the designed entanglement purification scheduling algorithm. We establish theoretical performance guarantee of our proposition, which is complemented by extensive numerical experiments demonstrating its effectiveness in a variety of network settings.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"495-506"},"PeriodicalIF":3.7,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delay-Constrained GNR Routing With CNT-Via Insertion in Nano-Scale Designs","authors":"Jin-Tai Yan","doi":"10.1109/JETCAS.2024.3424217","DOIUrl":"10.1109/JETCAS.2024.3424217","url":null,"abstract":"It is well known that graphene nanoribbon (GNR) can be used as interconnects in nano-scale designs. In this paper, given a set of delay-constrained GNR nets in a multiple-layer routing plane, based on the construction of a combined carbon nanotube (CNT)/graphene hetero-structure for CNT-vias between two adjacent layers, an efficient routing algorithm can be proposed to minimize the number of the used layers with satisfying the non-crossing constraints between two GNR nets and the delay constraints on the GNR nets in GNR routing with CNT-via insertion. In the initial assignment, based on the definition of the delay-constrained routing pattern on a GNR net with tight delay constraint and the delay-constrained via path on a GNR net, the delay-constrained routing patterns can be firstly assigned for layer minimization and the delay-driven minimum-length routing paths and the delay-constrained via paths can be further assigned onto the available layers. In the iterative routing, the unrouted GNR nets can be further routed on the available layers and some possible new layers by using one iterative maze-routing and rip-up-and-rerouting process. Compared with the published routing algorithms with no via insertion, the experimental results show that our proposed routing algorithm with CNT-via insertion can insert some CNT-vias and use shorter wirelength to decrease 53.8% and 24.9% of the number of the used layer under reasonable CPU time on the given GNR nets with two different sets of the delay constraints for 8 tested examples on the average, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"371-383"},"PeriodicalIF":3.7,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}