{"title":"MII: A Multifaceted Framework for Intermittence-Aware Inference and Scheduling","authors":"Ziliang Zhang;Cong Liu;Hyoseung Kim","doi":"10.1109/TCAD.2024.3443710","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443710","url":null,"abstract":"The concurrent execution of deep neural networks (DNNs) inference tasks on the intermittently-powered batteryless devices (IPDs) has recently garnered much attention due to its potential in a broad range of smart sensing applications. While the checkpointing mechanisms (CMs) provided by the state-of-the-art make this possible, scheduling inference tasks on IPDs is still a complex problem due to significant performance variations across the DNN layers and CM choices. This complexity is further accentuated by dynamic environmental conditions and inherent resource constraints of IPDs. To tackle these challenges, we present MII, a framework designed for the intermittence-aware inference and scheduling on IPDs. MII formulates the shutdown and live time functions of an IPD from profiling the data, which our offline intermittence-aware search scheme uses to find the optimal layer-wise CMs for each task. At runtime, MII enhances the job success rates by dynamically making scheduling decisions to mitigate the workload losses from the power interruptions and adjusting these CMs in response to the actual energy patterns. Our evaluation demonstrates the superiority of MII over the state-of-the-art. In controlled environments, MII achieves an average increase of 21% and 39% in successful jobs under the stable and dynamic energy patterns. In the real-world settings, MII achieves 33% and 24% more successful jobs indoors and outdoors.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3708-3719"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HMC-FHE: A Heterogeneous Near Data Processing Framework for Homomorphic Encryption","authors":"Zehao Chen;Zhining Cao;Zhaoyan Shen;Lei Ju","doi":"10.1109/TCAD.2024.3447212","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447212","url":null,"abstract":"Fully homomorphic encryption (FHE) offers a promising solution to ensure data privacy by enabling computations directly on encrypted data. However, its notorious performance degradation severely limits the practical application, due to the explosion of both the ciphertext volume and computation. In this article, leveraging the diversity of computing power and memory bandwidth requirements of FHE operations, we present HMC-FHE, a robust acceleration framework that combines both GPU and hybrid memory cube (HMC) processing engines to accelerate FHE applications cooperatively. HMC-FHE incorporates four key hardware/software co-design techniques: 1) a fine-grained kernel offloading mechanism to efficiently offload FHE operations to relevant processing engines; 2) a ciphertext partitioning scheme to minimize data transfer across decentralized HMC processing engines; 3) an FHE operation pipeline scheme to facilitate pipelined execution between GPU and HMC engines; and 4) a kernel tuning scheme to guarantee the parallelism of GPU and HMC engines. We demonstrate that the GPU-HMC architecture with proper resource management serves as a promising acceleration scheme for memory-intensive FHE operations. Compared with the state-of-the-art GPU-based acceleration scheme, the proposed framework achieves up to \u0000<inline-formula> <tex-math>$2.65times $ </tex-math></inline-formula>\u0000 performance gains and reduces \u0000<inline-formula> <tex-math>$1.81times $ </tex-math></inline-formula>\u0000 energy consumption with the same peak computation capacity.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3551-3563"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polynomial Neural Barrier Certificate Synthesis of Hybrid Systems via Counterexample Guidance","authors":"Hanrui Zhao;Banglong Liu;Lydia Dehbi;Huijiao Xie;Zhengfeng Yang;Haifeng Qian","doi":"10.1109/TCAD.2024.3447226","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447226","url":null,"abstract":"This article presents a novel approach to the safety verification of hybrid systems by synthesizing neural barrier certificates (BCs) via counterexample-guided neural network (NN) learning combined with sum-of-square (SOS)-based verification. We learn more easily verifiable BCs with NN polynomial expansions in a high-accuracy counterexamples guided framework. By leveraging the polynomial candidates yielded from the learning phase, we reformulate the identification of real BCs as convex linear matrix inequality (LMI) feasibility testing problems, instead of directly solving the inherently NP-hard nonconvex bilinear matrix inequality (BMI) problems associated with SOS-based BC generation. Furthermore, we decompose the large SOS verification programming into several manageable subprogrammings. Benefiting from the efficiency and scalability advantages, our approach can synthesize BCs not amenable to existing methods and handle more general hybrid systems.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3756-3767"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed Bakr Sikal;Heba Khdr;Lokesh Siddhu;Jörg Henkel
{"title":"ML-Based Thermal and Cache Contention Alleviation on Clustered Manycores With 3-D HBM","authors":"Mohammed Bakr Sikal;Heba Khdr;Lokesh Siddhu;Jörg Henkel","doi":"10.1109/TCAD.2024.3438998","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438998","url":null,"abstract":"Enabled by the recent advancements in 2.5D/3-D integration and packaging, the integration of clustered manycore processors with high-bandwidth memory (HBM) is gaining prominence to satisfy the increasing memory bandwidth demands. Although this integration can offer significant performance gains, it is still limited by cache contention in the final-level cache on the clusters and by the thermal issues in the 3-D HBM. While the existing state-of-the-art resource management techniques have tackled these issues in isolation, we argue that the cache contention and the temperature of both the manycore and the HBM must be considered jointly to harness the full performance potential of such modern architectures. To cover this gap in the literature, we present MTCM, the first resource management technique that considers the cache contention in maximizing the system performance, while maintaining the thermal safety across both the manycore and the HBM stack. Enabled by our accurate, yet lightweight, neural network models, our proposed task migration and dynamic voltage and frequency scaling policies can accurately predict the impact of runtime decisions on the performance and temperature of both the subsystems. Our extensive evaluation experiments reveal a significant performance improvement over existing state of the art by up to \u0000<inline-formula> <tex-math>$1times $ </tex-math></inline-formula>\u0000, while maintaining thermal safety of both the manycore and the HBM.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3614-3625"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdullah Al Arafat;Kurt Wilson;Kecheng Yang;Zhishan Guo
{"title":"Dynamic Priority Scheduling of Multithreaded ROS 2 Executor With Shared Resources","authors":"Abdullah Al Arafat;Kurt Wilson;Kecheng Yang;Zhishan Guo","doi":"10.1109/TCAD.2024.3445259","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445259","url":null,"abstract":"The second generation of robot operating system (ROS 2) received significant attention from the real-time system research community, mostly aiming at providing formal modeling and timing analysis. However, most of the current efforts are limited to the default scheduling design schemes of ROS 2. The unique scheduling policies maintained by default ROS 2 significantly affect the response time and acceptance rate of workload schedulability. It also invalidates the adaptation of the rich existing results related to nonpreemptive (and limited-preemptive) scheduling problems in the real-time systems community to ROS 2 schedulability analysis. This article aims to design, implement, and analyze a standard dynamic priority-based real-time scheduler for ROS 2 while handling shared resources. Specifically, we propose to replace the readySet with a readyQueue, which is much more efficient and comes with improvements for callback selection, queue updating, and a skipping scheme to avoid priority inversion from resource sharing. Such a novel ROS 2 executor design can also be used for efficient implementations of fixed priority policies and mixed-policy schedulers. Our modified executor maintains the compatibility with default ROS 2 architecture. We further identified and built a link between the scheduling of limited-preemption points tasks via the global earliest deadline first (GEDF) algorithm and ROS 2 processing chain scheduling without shared resources. Based on this, we formally capture the worst-case blocking time and thereby develop a response time analysis for ROS 2 processing chains with shared resources. We evaluate our scheduler by implementing our modified scheduler that accepts scheduling parameters from the system designer in ROS 2. We ran two case studies-one using real ROS 2 nodes to drive a small ground vehicle, and one using synthetic tasks. The second case study identifies a case where the modified executor prevents priority inversion. We also test our analysis with randomly generated workloads. In our tests, our modified scheduler performed better than the ROS 2 default. Our code is available online: \u0000<uri>https://github.com/RTIS-Lab/ROS-Dynamic-Executor</uri>\u0000.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3732-3743"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SENTINEL: Securing Indoor Localization Against Adversarial Attacks With Capsule Neural Networks","authors":"Danish Gufran;Pooja Anandathirtha;Sudeep Pasricha","doi":"10.1109/TCAD.2024.3446717","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446717","url":null,"abstract":"With the increasing demand for edge device-powered location-based services in indoor environments, Wi-Fi received signal strength (RSS) fingerprinting has become popular, given the unavailability of GPS indoors. However, achieving robust and efficient indoor localization faces several challenges, due to RSS fluctuations from dynamic changes in indoor environments and heterogeneity of edge devices, leading to diminished localization accuracy. While advances in machine learning (ML) have shown promise in mitigating these phenomena, it remains an open problem. Additionally, emerging threats from adversarial attacks on ML-enhanced indoor localization systems, especially those introduced by malicious or rogue access points (APs), can deceive ML models to further increase localization errors. To address these challenges, we present SENTINEL, a novel embedded ML framework utilizing modified capsule neural networks to bolster the resilience of indoor localization solutions against adversarial attacks, device heterogeneity, and dynamic RSS fluctuations. We also introduce RSSRogueLoc, a novel dataset capturing the effects of rogue APs from several real-world indoor environments. Experimental evaluations demonstrate that SENTINEL achieves significant improvements, with up to \u0000<inline-formula> <tex-math>$3.5times $ </tex-math></inline-formula>\u0000 reduction in mean error and \u0000<inline-formula> <tex-math>$3.4times $ </tex-math></inline-formula>\u0000 reduction in worst-case error compared to state-of-the-art frameworks using simulated adversarial attacks. SENTINEL also achieves improvements of up to \u0000<inline-formula> <tex-math>$2.8times $ </tex-math></inline-formula>\u0000 in mean error and \u0000<inline-formula> <tex-math>$2.7times $ </tex-math></inline-formula>\u0000 in worst-case error compared to state-of-the-art frameworks when evaluated with the real-world RSSRogueLoc dataset.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4021-4032"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible Generation of Fast and Accurate Software Performance Simulators From Compact Processor Descriptions","authors":"Conrad Foik;Robert Kunzelmann;Daniel Mueller-Gritschneder;Ulf Schlichtmann","doi":"10.1109/TCAD.2024.3445255","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445255","url":null,"abstract":"To find optimal solutions for modern embedded systems, designers frequently rely on the software performance simulators. These simulators combine an abstract functional description of a processor with a nonfunctional timing model to accurately estimate the processor’s timing while maintaining high simulation speeds. However, current performance simulators either inflexibly target specific processors or sacrifice accuracy or simulation speed. This article presents a new approach to the software performance simulation, combining flexibility with highly accurate estimates and high simulation speed. A code generator converts a compact structural description of the target processor’s pipeline into sets of timing constraints, describing the processor’s instruction execution. Based on these, it generates corresponding scheduling functions and timing variables, representing the availability of the modeled pipeline. The performance estimator uses these components to approximate the processor’s timing based on an instruction trace provided by an instruction set simulator. Results for the state-of-the-art CV32E40P and CVA6 RISC-V processors show an average relative error of 0.0015% and 3.88%, respectively, over a large set of benchmarks. Our approach reaches an average simulation speed of 24 and 15 million instructions per second (MIPS), respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4130-4141"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745761","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Search-in-Memory: Reliable, Versatile, and Efficient Data Matching in SSD’s NAND Flash Memory Chip for Data Indexing Acceleration","authors":"Yun-Chih Chen;Yuan-Hao Chang;Tei-Wei Kuo","doi":"10.1109/TCAD.2024.3443702","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443702","url":null,"abstract":"To index the increasing volume of data, modern data indexes are typically stored on solid-state drives and cached in DRAM. However, searching such an index has resulted in significant I/O traffic due to limited access locality and inefficient cache utilization. At the heart of index searching is the operation of filtering through vast data spans to isolate a small, relevant subset, which involves basic equality tests rather than the complex arithmetic provided by modern CPUs. This article demonstrates the feasibility of performing data filtering directly within a NAND flash memory chip, transmitting only relevant search results rather than complete pages. Instead of adding complex circuits, we propose repurposing existing circuitry for efficient and accurate bitwise parallel matching. We demonstrate how different data structures can use our flexible SIMD command interface to offload index searches. This strategy not only frees up the CPU for more computationally demanding tasks, but it also optimizes DRAM usage for write buffering, significantly lowering energy consumption associated with I/O transmission between the CPU and DRAM. Extensive testing across a wide range of workloads reveals up to a \u0000<inline-formula> <tex-math>$9times $ </tex-math></inline-formula>\u0000 speedup in write-heavy workloads and up to 45% energy savings due to reduced read and write I/O. Furthermore, we achieve significant reductions in median and tail read latencies of up to 89% and 85%, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3864-3875"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAs","authors":"Wenqi Lou;Yunji Qin;Xuan Wang;Lei Gong;Chao Wang;Xuehai Zhou","doi":"10.1109/TCAD.2024.3439488","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3439488","url":null,"abstract":"Block-circulant matrix (BCM) compression has garnered much attention in the hardware acceleration of convolutional neural networks (CNNs) due to its regularity and efficiency. However, constrained by the difficulty of exploring the compression parameter space, existing BCM-based methods often apply a uniform compression parameter to all CNN models’ layers, losing the compression’s flexibility. Additionally, independently optimizing models or accelerators makes achieving the optimal tradeoff between model accuracy and hardware efficiency challenging. To this end, we propose FlexBCM, a joint exploration framework that efficiently explores both the parameter compression and hardware parameter space to generate customized hybrid BCM-compressed CNN and field-programmable gate array (FPGA) accelerator solutions. On the algorithmic side, leveraging the idea of neural architecture search (NAS), we design an efficient differentiable sampling method to rapidly evaluate the accuracy of candidate subnets. Additionally, we devise a hardware-friendly frequency domain quantization scheme for BCM computation. On the hardware side, we develop the efficient and parameter-configurable convolutional core (ConvPU) alongside the BCM computing core (BCMPU). The BCMPU can flexibly accommodate different compression parameters at runtime, incorporate complex-number DSP packing and conjugate symmetry optimizations. For model-to-hardware evaluation, we construct accurate latency and resource consumption models. Moreover, we design a fast hardware generation algorithm based on the coarse-grained search to provide prompt feedback on the hardware evaluation of the current subnet. Finally, we validate FlexBCM on the Xilinx ZCU102 FPGA and compare its compressed CNN-accelerator solutions with previous state-of-the-art works. Experimental results demonstrate that FlexBCM achieves 1.21–3.02 times higher-computational efficiency for ResNet18 and ResNet34 models while maintaining an acceptable accuracy loss on the ImageNet dataset.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3852-3863"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaotian Guo;Quan Jiang;Yixian Shen;Andy D. Pimentel;Todor Stefanov
{"title":"EASTER: Learning to Split Transformers at the Edge Robustly","authors":"Xiaotian Guo;Quan Jiang;Yixian Shen;Andy D. Pimentel;Todor Stefanov","doi":"10.1109/TCAD.2024.3438995","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438995","url":null,"abstract":"Prevalent large transformer models present significant computational challenges for resource-constrained devices at the Edge. While distributing the workload of deep learning models across multiple edge devices has been extensively studied, these works typically overlook the impact of failures of edge devices. Unpredictable failures, due to, e.g., connectivity issues or discharged batteries, can compromise the reliability of inference serving at the Edge. In this article, we introduce a novel methodology, called EASTER, designed to learn robust distribution strategies for transformer models against device failures that consider the tradeoff between robustness (i.e., maintaining model functionality against failures) and resource utilization (considering memory usage and computations). We evaluate EASTER with three representative transformers—ViT, GPT-2, and Vicuna—under device failures. Our results demonstrate EASTER’s efficiency in memory usage, and possible end-to-end latency improvement for inference across multiple edge devices while preserving model accuracy as much as possible under device failures.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3626-3637"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}