{"title":"EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture","authors":"Peiyan Dong;Jinming Zhuang;Zhuoping Yang;Shixin Ji;Yanyu Li;Dongkuan Xu;Heng Huang;Jingtong Hu;Alex K. Jones;Yiyu Shi;Yanzhi Wang;Peipei Zhou","doi":"10.1109/TCAD.2024.3443692","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443692","url":null,"abstract":"While vision transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (<1> <tex-math>$13.1times $ </tex-math></inline-formula>\u0000 over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, AMD ZCU102, and U250 FPGAs. The energy efficiency gains are 62.2, 15.33, 12.82, 13.31, 13.5, and \u0000<inline-formula> <tex-math>$21.9times $ </tex-math></inline-formula>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3949-3960"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan
{"title":"NOBtree: A NUMA-Optimized Tree Index for Nonvolatile Memory","authors":"Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan","doi":"10.1109/TCAD.2024.3438111","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438111","url":null,"abstract":"Nonvolatile memory (NVM) suffers from more serious nonuniform memory access (NUMA) effects than DRAM because of the lower bandwidth and higher latency. While numerous works have aimed at optimizing NVM indexes, only a few of them tried to address the NUMA impact. Existing approaches mainly rely on local NVM write buffers or DRAM-based read buffers to mitigate the cost of remote NVM access, which introduces memory overhead and causes performance degradation for lookup and scan operations. In this article, we present NOBtree, a new NUMA-optimized persistent tree index. The novelty of NOBtree is two-fold. First, NOBtree presents per-NUMA replication and an efficient node-migration mechanism to reduce remote NVM access. Second, NOBtree proposes a NUMA-aware NVM allocator to improve the insert performance and scalability. We conducted experiments on six workloads to evaluate the performance of NOBtree. The results show that NOBtree can effectively reduce the number of remote NVM accesses. Moreover, NOBtree outperforms existing persistent indexes, including TLBtree, Fast&Fair, ROART, and PACtree, by up to \u0000<inline-formula> <tex-math>$3.23times $ </tex-math></inline-formula>\u0000 in throughput and \u0000<inline-formula> <tex-math>$4.07times $ </tex-math></inline-formula>\u0000 in latency.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3840-3851"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arch2End: Two-Stage Unified System-Level Modeling for Heterogeneous Intelligent Devices","authors":"Weihong Liu;Zongwei Zhu;Boyu Li;Yi Xiong;Zirui Lian;Jiawei Geng;Xuehai Zhou","doi":"10.1109/TCAD.2024.3443706","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443706","url":null,"abstract":"The surge in intelligent edge computing has propelled the adoption and expansion of the distributed embedded systems (DESs). Numerous scheduling strategies are introduced to improve the DES throughput, such as latency-aware and group-based hierarchical scheduling. Effective device modeling can help in modular and plug-in scheduler design. For uniformity in scheduling interfaces, an unified device performance modeling is adopted, typically involving the system-level modeling that incorporates both the hardware and software stacks, broadly divided into two categories. Fine-grained modeling methods based on the hardware architecture analysis become very difficult when dealing with a large number of heterogeneous devices, mainly because much architecture information is closed-source and costly to analyse. Coarse-grained methods are based on the limited architecture information or benchmark models, resulting in insufficient generalization in the complex inference performance of diverse deep neural networks (DNNs). Therefore, we introduce a two-stage system-level modeling method (Arch2End), combining limited architecture information with scalable benchmark models to achieve an unified performance representation. Stage one leverages public information to analyse architectures in an uniform abstraction and to design the benchmark models for exploring the device performance boundaries, ensuring uniformity. Stage two extracts critical device features from the end-to-end inference metrics of extensive simulation models, ensuring universality and enhancing characterization capacity. Compared to the state-of-the-art methods, Arch2End achieves the lowest DNN latency prediction relative errors in the NAS-Bench-201 (1.7%) and real-world DNNs (8.2%). It also showcases superior performance in intergroup balanced device grouping strategies.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4154-4165"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks","authors":"Salma Afifi;Ishan Thakkar;Sudeep Pasricha","doi":"10.1109/TCAD.2024.3446719","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446719","url":null,"abstract":"Transformers have emerged as a powerful tool for natural language processing (NLP) and computer vision. Through the attention mechanism, these models have exhibited remarkable performance gains when compared to conventional approaches like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nevertheless, transformers typically demand substantial execution time due to their extensive computations and large memory footprint. Processing in-memory (PIM) and near-memory computing (NMC) are promising solutions to accelerating transformers as they offer high-compute parallelism and memory bandwidth. However, designing PIM/NMC architectures to support the complex operations and massive amounts of data that need to be moved between layers in transformer neural networks remains a challenge. We propose ARTEMIS, a mixed analog-stochastic in-DRAM accelerator for transformer models. Through employing minimal changes to the conventional DRAM arrays, ARTEMIS efficiently alleviates the costs associated with transformer model execution by supporting stochastic computing for multiplications and temporal analog accumulations using a novel in-DRAM metal-on-metal capacitor. Our analysis indicates that ARTEMIS exhibits at least \u0000<inline-formula> <tex-math>$3.0times $ </tex-math></inline-formula>\u0000 speedup, and \u0000<inline-formula> <tex-math>$1.8times $ </tex-math></inline-formula>\u0000 lower energy compared to GPU, TPU, CPU, and state-of-the-art PIM transformer hardware accelerators.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3336-3347"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent RAGE: Randomness Assessment Using Generative Entropy Models","authors":"Kuheli Pratihar;Rajat Subhra Chakraborty;Debdeep Mukhopadhyay","doi":"10.1109/TCAD.2024.3449562","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3449562","url":null,"abstract":"NIST’s recent review of the widely employed special publication (SP) 800–22 randomness testing suite has underscored several shortcomings, particularly the absence of entropy source modeling and the necessity for large sequence lengths. Motivated by this revelation, we explore low-dimensional modeling of the entropy source in random number generators (RNGs) using a variational autoencoder (VAE). This low-dimensional modeling enables the separation between strong and weak entropy sources by magnifying the deterministic effects in the latter, which are otherwise difficult to detect with conventional testing. Bits from weak-entropy RNGs with bias, correlation, or deterministic patterns are more likely to lie on a low-dimensional manifold within a high-dimensional space, in contrast to strong-entropy RNGs, such as true RNGs (TRNGs) and pseudo-RNGs (PRNGs) with uniformly distributed bits. We exploit this insight to employ a generative AI-based noninterference test (GeNI) for the first time, achieving implementation-agnostic low-dimensional modeling of all types of entropy sources. GeNI’s generative aspect uses VAEs to produce synthetic bitstreams from the latent representation of RNGs, which are subjected to a deep learning (DL)-based noninterference (NI) test evaluating the masking ability of the synthetic bitstreams. The core principle of the NI test is that if the bitstream exhibits high-quality randomness, the masked data from the two sources should be indistinguishable. GeNI facilitates a comparative analysis of low-dimensional entropy source representations across various RNGs, adeptly identifying the artificial randomness in specious RNGs with deterministic patterns that otherwise passes all NIST SP800-22 tests. Notably, GeNI achieves this with \u0000<inline-formula> <tex-math>$10times $ </tex-math></inline-formula>\u0000 lower-sequence lengths and \u0000<inline-formula> <tex-math>$16.5times $ </tex-math></inline-formula>\u0000 faster execution time compared to the NIST test suite.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3503-3514"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space Exploration","authors":"Xuyang Zhao;Tianning Gao;Aidong Zhao;Zhaori Bi;Changhao Yan;Fan Yang;Sheng-Guo Wang;Dian Zhou;Xuan Zeng","doi":"10.1109/TCAD.2024.3443006","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443006","url":null,"abstract":"Exploring the design space of RISC-V processors faces significant challenges due to the vastness of the high-dimensional design space and the associated expensive simulation costs. This work proposes a region of interest (ROI)-driven method, which focuses on the promising ROIs to reduce the over-exploration on the huge design space and improve the optimization efficiency. A tree structure based on self-organizing map (SOM) networks is proposed to partition the design space into ROIs. To reduce the high dimensionality of design space, a variable selection technique based on a sensitivity matrix is developed to prune unimportant design parameters and efficiently hit the optimum inside the ROIs. Moreover, an asynchronous parallel strategy is employed to further save the time taken by simulations. Experimental results demonstrate the superiority of our proposed method, achieving improvements of up to 43.82% in performance, 33.20% in power consumption, and 11.41% in area compared to state-of-the-art methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4178-4189"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras
{"title":"Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs","authors":"A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras","doi":"10.1109/TCAD.2024.3445815","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445815","url":null,"abstract":"Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables \u0000<inline-formula> <tex-math>$1.1times -14times $ </tex-math></inline-formula>\u0000 faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4202-4213"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FDPUF: Frequency-Domain PUF for Robust Authentication of Edge Devices","authors":"Shubhra Deb Paul;Aritra Dasgupta;Swarup Bhunia","doi":"10.1109/TCAD.2024.3447211","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447211","url":null,"abstract":"Counterfeiting, overproduction, and cloning of integrated circuits (ICs) and associated hardware have emerged as major security concerns in the modern globalized microelectronics supply chain. One way to combat these issues effectively is to deploy hardware authentication techniques that utilize physical unclonable functions (PUFs). PUFs utilize intrinsic variations in hardware that occur during the manufacturing and fabrication process to generate device-specific fingerprints or immutable signatures that cannot be replicated by counterfeits and clones. However, unavoidable factors like environmental noise and harmonics can significantly deteriorate the quality of the PUF signature. Besides, conventional PUF solutions are generally not amenable to in-field authentication of hardware, which has emerged as a critical need for Internet of Things (IoT) edge devices to detect physical attacks on them. In this article, we introduce frequency-domain PUF or FDPUF, a novel PUF that analyzes time-domain current waveforms in the frequency domain to create high-quality authentication signatures that are suitable for in-field authentication. FDPUF decomposes electrical signals into their spectral coefficients, filters out unnecessary low-energy components, reconstructs the waveforms, and generates high-quality digital fingerprints for device authentication purposes. Compared to the existing authentication mechanisms, the higher quality of the signatures through the frequency-domain analysis makes the proposed FDPUF more suitable for protecting the integrity of the edge computing hardware. We perform experimental measurements on FPGA and analyze FDPUF properties using the National Institute of Standards and Technology test suite to demonstrate that the FDPUF provides better uniqueness and robustness than its time-domain counterpart while being attractive for in-field authentication.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3479-3490"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann
{"title":"GOURD: Tensorizing Streaming Applications to Generate Multi-Instance Compute Platforms","authors":"Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann","doi":"10.1109/TCAD.2024.3445810","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445810","url":null,"abstract":"In this article, we rethink the dataflow processing paradigm to a higher level of abstraction to automate the generation of multi-instance compute and memory platforms with interfaces to I/O devices (sensors and actuators). Since the different compute instances (NPUs, CPUs, DSPs, etc.) and I/O devices do not necessarily have compatible interfaces on a dataflow level, an automated translation is required. However, in multidimensional dataflow scenarios, it becomes inherently difficult to reason about buffer sizes and iteration order without knowing the shape of the data access pattern (DAP) that the dataflow follows. To capture this shape and the platform composition, we define a domain-specific representation (DSR) and devise a toolchain to generate a synthesizable platform, including appropriate streaming buffers for platform-specific tensorization of the data between incompatible interfaces. This allows platforms, such as sensor edge AI devices, to be easily specified by simply focusing on the shape of the data provided by the sensors and transmitted among compute units, giving the ability to evaluate and generate different dataflow design alternatives with significantly reduced design time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4166-4177"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745814","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique
{"title":"DREAMx: A Data-Driven Error Estimation Methodology for Adders Composed of Cascaded Approximate Units","authors":"Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique","doi":"10.1109/TCAD.2024.3447209","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447209","url":null,"abstract":"Due to the significance and broad utilization of adders in computing systems, the design of low-power approximate adders (LPAAs) has received a significant amount of attention from the system design community. However, the selection and deployment of appropriate approximate modules require a thorough design space exploration, which is (in general) an extremely time-consuming process. Toward reducing the exploration time, different error estimation techniques have been proposed in the literature for evaluating the quality metrics of approximate adders. However, most of them are based on certain assumptions that limit the usability of such techniques for real-world settings. In this work, we highlight the impact of those assumptions on the quality of error estimates provided by the state-of-the-art techniques and how they limit the use of such techniques for real-world settings. Moreover, we highlight the significance of considering input data characteristics to improve the quality of error estimation. Based on our analysis, we propose a systematic data-driven error estimation methodology, DREAMx, for adders composed of cascaded approximate units, which covers a predominant set of LPAAs. DREAMx in principle factors in the dependence between input bits based on the given input distribution to compute the probability mass function (PMF) of error value at the output of an approximate adder. It achieves improved results compared to the state-of-the-art techniques while offering a substantial decrease in the overall execution(/exploration) time compared to exhaustive simulations. Our results further show that there exists a delicate tradeoff between the achievable quality of error estimates and the overall execution time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3348-3357"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}