{"title":"KPAC: Efficient Emulation of the ARM Pointer Authentication Instructions","authors":"Illia Ostapyshyn;Gabriele Serra;Tim-Marek Thomas;Daniel Lohmann","doi":"10.1109/TCAD.2024.3443773","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443773","url":null,"abstract":"ARMv8.3-A has introduced the pointer authentication (PA) feature, a new set of measures and instructions to sign and validate pointers. PA is already used and supported by the major compilers to protect the return addresses on the stack as a measure against memory corruption attacks. As more and more SoCs implement ARMv8.3-A and code compiled with PA is even fully backwards compatible on CPUs without (where the new instructions are just ignored), we can expect PA-enabled binaries to become standard in the near future. This gives rise to the question, if and how also systems without the native PA could benefit from the extra security provided by the return address protection. In this article, we explore KPAC, a set of efficient software-based approaches to bring the PA-based return-address protection onto the platforms without the hardware support in an easily adoptable (binary-compatible) and scalable manner. Technically, KPAC achieves this by either a synchronous trap-based emulation inside the kernel or an asynchronous novel memory-based invocation of a dedicated CPU core. Our experiments with the CortexSuite benchmarks, Chromium, and Memcached on a variety of platforms running Linux ranging from a Xilinx ZCU102 board over a Raspberry Pi 4 up to an 80-core Ampere Altra demonstrate the broad applicability and scalability of our approach. Furthermore, we discuss how the principles of KPAC can be generalized to the other suited problem areas.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3467-3478"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiankang Ren;Chunxiao Liu;Chi Lin;Wei Jiang;Pengfei Wang;Xiangwei Qi;Simeng Li;Shengyu Li
{"title":"Multimode Security-Aware Real-Time Scheduling on Multiprocessors","authors":"Jiankang Ren;Chunxiao Liu;Chi Lin;Wei Jiang;Pengfei Wang;Xiangwei Qi;Simeng Li;Shengyu Li","doi":"10.1109/TCAD.2024.3445260","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445260","url":null,"abstract":"Embedded real-time systems generally execute in a predictable and deterministic manner to deliver critical functionality within stringent timing constraints. However, the predictable execution behavior leaves the system vulnerable to schedule-based attacks. In this article, we present a multimode security-aware real-time scheduling scheme to counteract schedule-based attacks on multiprocessor real-time systems. To mitigate the vulnerability to the schedule-based attack, we propose a multimode scheduling method to reduce the accumulative attack effective window (AEW) of multiple victim tasks and prevent the untrusted tasks from executing during the AEW by distinctively scheduling mixed-trust tasks according to the system mode. To avoid the protection degradation due to the excessive blocking of untrusted tasks, we introduce a protection window for multiple victims on multiprocessors by analyzing the system protection capability limit under the system schedulability constraint. Furthermore, to maximize the protection capability of the multimode security-aware scheduling strategy on a multiprocessor platform, we also propose a security-aware packing algorithm to balance the workloads of mixed-trust tasks on different processors using a mixed-trust worst-fit decreasing heuristic strategy. The experimental results demonstrate that our proposed approach significantly outperforms the state-of-the-art method. Specifically, the AEW ratio and the AEW untrusted execution time ratio are reduced by 18.8% and 62.8%, respectively, while the defense success rate against ScheduLeak attack is improved by 16.3%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3407-3418"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AxOSpike: Spiking Neural Networks-Driven Approximate Operator Design","authors":"Salim Ullah;Siva Satyendra Sahoo;Akash Kumar","doi":"10.1109/TCAD.2024.3443000","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443000","url":null,"abstract":"Approximate computing (AxC) is being widely researched as a viable approach to deploying compute-intensive artificial intelligence (AI) applications on resource-constrained embedded systems. In general, AxC aims to provide disproportionate gains in system-level power-performance-area (PPA) by leveraging the implicit error tolerance of an application. One of the more widely used methods in AxC involves circuit pruning of arithmetic operators used to process AI workloads. However, most related works adopt an application-agnostic approach to operator modeling for the design space exploration (DSE) of Approximate Operators (AxOs). To this end, we propose an application-driven approach to designing AxOs. Specifically, we use spiking neural network (SNN)-based inference to present an application-driven operator model resulting in AxOs with better-PPA-accuracy tradeoffs compared to traditional circuit pruning. Additionally, we present a novel FPGA-specific operator model to improve the quality of AxOs that can be obtained using circuit pruning. With the proposed methods, we report designs with up to 26.5% lower PDPxLUTs with similar application-level accuracy. Further, we report a considerably better set of design points than related works with up to 51% better-Pareto front hypervolume.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3324-3335"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VALO: A Versatile Anytime Framework for LiDAR-Based Object Detection Deep Neural Networks","authors":"Ahmet Soyyigit;Shuochao Yao;Heechul Yun","doi":"10.1109/TCAD.2024.3443774","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443774","url":null,"abstract":"This work addresses the challenge of adapting dynamic deadline requirements for the LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, the state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on the resource-constrained edge platforms. Therefore, a tradeoff between the detection accuracy and latency should be dynamically managed at runtime to achieve the optimum results. In this article, we introduce versatile anytime algorithm for the LiDAR Object detection (VALO), a novel data-centric approach that enables anytime computing of 3-D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process the input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of the past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate the execution without sacrificing accuracy. We implement VALO on the state-of-the-art 3-D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available at \u0000<uri>https://github.com/CSL-KU/VALOgithub.com/CSL-KU/VALO</uri>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4045-4056"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture","authors":"Peiyan Dong;Jinming Zhuang;Zhuoping Yang;Shixin Ji;Yanyu Li;Dongkuan Xu;Heng Huang;Jingtong Hu;Alex K. Jones;Yiyu Shi;Yanzhi Wang;Peipei Zhou","doi":"10.1109/TCAD.2024.3443692","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443692","url":null,"abstract":"While vision transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (<1> <tex-math>$13.1times $ </tex-math></inline-formula>\u0000 over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, AMD ZCU102, and U250 FPGAs. The energy efficiency gains are 62.2, 15.33, 12.82, 13.31, 13.5, and \u0000<inline-formula> <tex-math>$21.9times $ </tex-math></inline-formula>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3949-3960"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan
{"title":"NOBtree: A NUMA-Optimized Tree Index for Nonvolatile Memory","authors":"Zhaole Chu;Peiquan Jin;Yongping Luo;Xiaoliang Wang;Shouhong Wan","doi":"10.1109/TCAD.2024.3438111","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438111","url":null,"abstract":"Nonvolatile memory (NVM) suffers from more serious nonuniform memory access (NUMA) effects than DRAM because of the lower bandwidth and higher latency. While numerous works have aimed at optimizing NVM indexes, only a few of them tried to address the NUMA impact. Existing approaches mainly rely on local NVM write buffers or DRAM-based read buffers to mitigate the cost of remote NVM access, which introduces memory overhead and causes performance degradation for lookup and scan operations. In this article, we present NOBtree, a new NUMA-optimized persistent tree index. The novelty of NOBtree is two-fold. First, NOBtree presents per-NUMA replication and an efficient node-migration mechanism to reduce remote NVM access. Second, NOBtree proposes a NUMA-aware NVM allocator to improve the insert performance and scalability. We conducted experiments on six workloads to evaluate the performance of NOBtree. The results show that NOBtree can effectively reduce the number of remote NVM accesses. Moreover, NOBtree outperforms existing persistent indexes, including TLBtree, Fast&Fair, ROART, and PACtree, by up to \u0000<inline-formula> <tex-math>$3.23times $ </tex-math></inline-formula>\u0000 in throughput and \u0000<inline-formula> <tex-math>$4.07times $ </tex-math></inline-formula>\u0000 in latency.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3840-3851"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arch2End: Two-Stage Unified System-Level Modeling for Heterogeneous Intelligent Devices","authors":"Weihong Liu;Zongwei Zhu;Boyu Li;Yi Xiong;Zirui Lian;Jiawei Geng;Xuehai Zhou","doi":"10.1109/TCAD.2024.3443706","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443706","url":null,"abstract":"The surge in intelligent edge computing has propelled the adoption and expansion of the distributed embedded systems (DESs). Numerous scheduling strategies are introduced to improve the DES throughput, such as latency-aware and group-based hierarchical scheduling. Effective device modeling can help in modular and plug-in scheduler design. For uniformity in scheduling interfaces, an unified device performance modeling is adopted, typically involving the system-level modeling that incorporates both the hardware and software stacks, broadly divided into two categories. Fine-grained modeling methods based on the hardware architecture analysis become very difficult when dealing with a large number of heterogeneous devices, mainly because much architecture information is closed-source and costly to analyse. Coarse-grained methods are based on the limited architecture information or benchmark models, resulting in insufficient generalization in the complex inference performance of diverse deep neural networks (DNNs). Therefore, we introduce a two-stage system-level modeling method (Arch2End), combining limited architecture information with scalable benchmark models to achieve an unified performance representation. Stage one leverages public information to analyse architectures in an uniform abstraction and to design the benchmark models for exploring the device performance boundaries, ensuring uniformity. Stage two extracts critical device features from the end-to-end inference metrics of extensive simulation models, ensuring universality and enhancing characterization capacity. Compared to the state-of-the-art methods, Arch2End achieves the lowest DNN latency prediction relative errors in the NAS-Bench-201 (1.7%) and real-world DNNs (8.2%). It also showcases superior performance in intergroup balanced device grouping strategies.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4154-4165"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural Networks","authors":"Salma Afifi;Ishan Thakkar;Sudeep Pasricha","doi":"10.1109/TCAD.2024.3446719","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446719","url":null,"abstract":"Transformers have emerged as a powerful tool for natural language processing (NLP) and computer vision. Through the attention mechanism, these models have exhibited remarkable performance gains when compared to conventional approaches like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Nevertheless, transformers typically demand substantial execution time due to their extensive computations and large memory footprint. Processing in-memory (PIM) and near-memory computing (NMC) are promising solutions to accelerating transformers as they offer high-compute parallelism and memory bandwidth. However, designing PIM/NMC architectures to support the complex operations and massive amounts of data that need to be moved between layers in transformer neural networks remains a challenge. We propose ARTEMIS, a mixed analog-stochastic in-DRAM accelerator for transformer models. Through employing minimal changes to the conventional DRAM arrays, ARTEMIS efficiently alleviates the costs associated with transformer model execution by supporting stochastic computing for multiplications and temporal analog accumulations using a novel in-DRAM metal-on-metal capacitor. Our analysis indicates that ARTEMIS exhibits at least \u0000<inline-formula> <tex-math>$3.0times $ </tex-math></inline-formula>\u0000 speedup, and \u0000<inline-formula> <tex-math>$1.8times $ </tex-math></inline-formula>\u0000 lower energy compared to GPU, TPU, CPU, and state-of-the-art PIM transformer hardware accelerators.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3336-3347"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent RAGE: Randomness Assessment Using Generative Entropy Models","authors":"Kuheli Pratihar;Rajat Subhra Chakraborty;Debdeep Mukhopadhyay","doi":"10.1109/TCAD.2024.3449562","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3449562","url":null,"abstract":"NIST’s recent review of the widely employed special publication (SP) 800–22 randomness testing suite has underscored several shortcomings, particularly the absence of entropy source modeling and the necessity for large sequence lengths. Motivated by this revelation, we explore low-dimensional modeling of the entropy source in random number generators (RNGs) using a variational autoencoder (VAE). This low-dimensional modeling enables the separation between strong and weak entropy sources by magnifying the deterministic effects in the latter, which are otherwise difficult to detect with conventional testing. Bits from weak-entropy RNGs with bias, correlation, or deterministic patterns are more likely to lie on a low-dimensional manifold within a high-dimensional space, in contrast to strong-entropy RNGs, such as true RNGs (TRNGs) and pseudo-RNGs (PRNGs) with uniformly distributed bits. We exploit this insight to employ a generative AI-based noninterference test (GeNI) for the first time, achieving implementation-agnostic low-dimensional modeling of all types of entropy sources. GeNI’s generative aspect uses VAEs to produce synthetic bitstreams from the latent representation of RNGs, which are subjected to a deep learning (DL)-based noninterference (NI) test evaluating the masking ability of the synthetic bitstreams. The core principle of the NI test is that if the bitstream exhibits high-quality randomness, the masked data from the two sources should be indistinguishable. GeNI facilitates a comparative analysis of low-dimensional entropy source representations across various RNGs, adeptly identifying the artificial randomness in specious RNGs with deterministic patterns that otherwise passes all NIST SP800-22 tests. Notably, GeNI achieves this with \u0000<inline-formula> <tex-math>$10times $ </tex-math></inline-formula>\u0000 lower-sequence lengths and \u0000<inline-formula> <tex-math>$16.5times $ </tex-math></inline-formula>\u0000 faster execution time compared to the NIST test suite.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3503-3514"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space Exploration","authors":"Xuyang Zhao;Tianning Gao;Aidong Zhao;Zhaori Bi;Changhao Yan;Fan Yang;Sheng-Guo Wang;Dian Zhou;Xuan Zeng","doi":"10.1109/TCAD.2024.3443006","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443006","url":null,"abstract":"Exploring the design space of RISC-V processors faces significant challenges due to the vastness of the high-dimensional design space and the associated expensive simulation costs. This work proposes a region of interest (ROI)-driven method, which focuses on the promising ROIs to reduce the over-exploration on the huge design space and improve the optimization efficiency. A tree structure based on self-organizing map (SOM) networks is proposed to partition the design space into ROIs. To reduce the high dimensionality of design space, a variable selection technique based on a sensitivity matrix is developed to prune unimportant design parameters and efficiently hit the optimum inside the ROIs. Moreover, an asynchronous parallel strategy is employed to further save the time taken by simulations. Experimental results demonstrate the superiority of our proposed method, achieving improvements of up to 43.82% in performance, 33.20% in power consumption, and 11.41% in area compared to state-of-the-art methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4178-4189"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}