Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, Wayne Luk
{"title":"LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics","authors":"Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, Wayne Luk","doi":"10.1145/3640464","DOIUrl":"https://doi.org/10.1145/3640464","url":null,"abstract":"<p>This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. In addition, we propose a custom code transformation for the matrix multiplication operations, which leverages the structured sparsity patterns and binary features of adjacency matrices to reduce latency and improve hardware efficiency. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"5 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139474911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Casini, Dakshina Dasari, Matthias Becker, Giorgio Buttazzo
{"title":"Introduction to the Special Issue on Real-Time Computing in the IoT-to-Edge-to-Cloud Continuum","authors":"Daniel Casini, Dakshina Dasari, Matthias Becker, Giorgio Buttazzo","doi":"10.1145/3605180","DOIUrl":"https://doi.org/10.1145/3605180","url":null,"abstract":"<p>No abstract available.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"29 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Securing Pacemakers using Runtime Monitors over Physiological Signals","authors":"Abhinandan Panda, Srinivas Pinisetty, Partha Roop","doi":"10.1145/3638286","DOIUrl":"https://doi.org/10.1145/3638286","url":null,"abstract":"<p>Wearable and implantable medical devices (IMDs) are increasingly deployed to diagnose, monitor, and provide therapy for critical medical conditions. Such medical devices are safety-critical cyber-physical systems (CPSs). These systems support wireless features introducing potential security vulnerabilities. Although these devices undergo rigorous safety certification processes, runtime security attacks are inevitable. Based on published literature, IMDs such as pacemakers and insulin infusion systems can be remotely controlled to inject deadly electric shocks and excess insulin, posing a threat to a patient’s life. While prior works based on formal methods have been proposed to detect potential attack vectors using different forms of static analysis, these have limitations in preventing attacks at runtime. </p><p>This paper discusses a formal framework for detecting cyber-physical attacks on a pacemaker by monitoring its security policies at runtime. We propose a wearable device that senses the Electrocardiogram (ECG) and Photoplethysmogram (PPG) of the body to detect attacks in a pacemaker. To facilitate the design of this device, we map the security policies of a pacemaker w.r.t ECG and PPG, paving the way for designing formal verification monitors for pacemakers for the first time using multiple physiological signals. The proposed monitoring framework allows the synthesis of parallel monitors from a given set of desired security policies, where all the monitors execute concurrently and generate an alarm to the user in the case of policy violation. Our implementation and the performance evaluation results demonstrate the technical feasibility of designing such a wearable device for attack detection in pacemakers. This device is separate from the pacemaker, ensuring no need for re-certification of pacemakers. Our approach is amenable to the application of security patches when new attack vectors are detected, making the approach ideal for runtime monitoring of medical CPSs.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"44 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139375901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Design Flow for Scheduling Spiking Deep Convolutional Neural Networks on Heterogeneous Neuromorphic System-on-Chip","authors":"Anup Das","doi":"10.1145/3635032","DOIUrl":"https://doi.org/10.1145/3635032","url":null,"abstract":"<p>Neuromorphic systems-on-chip (NSoCs) integrate CPU cores and neuromorphic hardware accelerators on the same chip. These platforms can execute spiking deep convolutional neural networks (SDCNNs) with a low energy footprint. Modern NSoCs are heterogeneous in terms of their computing, communication, and storage resources. This makes scheduling SDCNN operations a combinatorial problem of exploring an exponentially-large state space in determining mapping, ordering, and timing of operations to achieve a target hardware performance, e.g., throughput. </p><p>We propose a systematic design flow to schedule SDCNNs on an NSoC. Our scheduler, called SMART (<underline>S</underline>DCNN <underline>MA</underline>pping, Orde<underline>R</underline>ing, and <underline>T</underline>iming), branches the combinatorial optimization problem into computationally-relaxed sub-problems that generate fast solutions without significantly compromising the solution quality. SMART improves performance by efficiently incorporating the heterogeneity in computing, communication, and storage resources. SMART operates in four steps. First, it creates a self-timed execution schedule to map operations to compute resources, maximizing throughput. Second, it uses an optimization strategy to distribute activation and synaptic weights to storage resources, minimizing data communication-related overhead. Third, it constructs an inter-processor communication (IPC) graph with a transaction order for its communication actors. This transaction order is created using a transaction partial order algorithm, which minimizes contention on the shared communication resources. Finally, it schedules this IPC graph to hardware by overlapping communication with the computation, and leveraging operation, pipeline, and batch parallelism. </p><p>We evaluate SMART using 10 representative image, object, and language-based SDCNNs. Results show that SMART increases throughput by an average 23%, compared to a state-of-the-art scheduler. SMART is implemented entirely in software as a compiler extension. It doesn’t require any change in a neuromorphic hardware or its interface to CPUs. It improves throughput with only a marginal increase in the compilation time. SMART is released under the open-source MIT licensing at https://github.com/drexel-DISCO/SMART\u0000to foster future research.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"47 6","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration","authors":"Huamei Qi, Fang Ren, Leilei Wang, Ping Jiang, Shaohua Wan, Xiaoheng Deng","doi":"10.1145/3634704","DOIUrl":"https://doi.org/10.1145/3634704","url":null,"abstract":"<p>Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment and the diversity of end devices pose a significant challenge for DNN model partitioning strategies. Meanwhile, limited resources of edge server make it difficult to manage resource allocation efficiently among multiple devices. In addition, most of the existing studies disregard the different service requirements of the DNN inference tasks, such as its high accuracy-sensitive or high latency-sensitive. To address these challenges, we propose a Multi-Compression Scale DNN Inference Acceleration (MCIA) based on cloud-edge-end collaboration. We model this problem as a mixed-integer multi-dimensional optimization problem, jointly optimizing the DNN model version choice, the partitioning choice, and the allocation of computational and bandwidth resources to maximize the tradeoff between inference accuracy and latency depending on the property of the tasks. Initially, we train multiple versions of DNN inference models with different compression scales in the cloud, and deploy them to end devices and edge server. Next, a deep reinforcement learning-based algorithm is developed for joint decision making of adaptive collaborative inference and resource allocation based on the current multi-compression scale models and the task property. Experimental results show that MCIA can adapt to heterogeneous devices and dynamic networks, and has superior performance compared with other methods.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"50 8","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and Analysis of ETC Control System with Colored Petri Net and Dynamic Slicing","authors":"Wangyang Yu, Jinming Kong, Zhijun Ding, Xiaojun Zhai, Zhiqiang Li, Qi Guo","doi":"10.1145/3633450","DOIUrl":"https://doi.org/10.1145/3633450","url":null,"abstract":"<p>Nowadays, an Electronic Toll Collection (ETC) control system in highways has been widely adopted to smoothen traffic flow. However, as it is a complex business interaction system, there are inevitably flaws in its control logic process, such as the problem of vehicle fee evasion. Even we find that there are more than one way for vehicles to evade fees. This shows that it is difficult to ensure the completeness of its design. Therefore, it is necessary to adopt a novel formal method to model and analyze its design, detect flaws and modify it. In this paper, a Colored Petri net (CPN) is introduced to establish its model. To analyze and modify the system model more efficiently, a dynamic slicing method of CPN is proposed. First, a static slice is obtained from the static slicing criterion by backtracking. Second, considering all binding elements that can be enabled under the initial marking, a forward slice is obtained from the dynamic slicing criterion by traversing. Third, the dynamic slicing of CPN is obtained by taking the intersection of both slices. The proposed dynamic slicing method of CPN can be used to formalize and verify the behavior properties of an ETC control system, and the flaws can be detected effectively. As a case study, the flaw about a vehicle that has not completed the payment following the previous vehicle to pass the railing is detected by the proposed method.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"50 4","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Virtual Environment Model Generation for CPS Goal Verification using Imitation Learning","authors":"Yong-Jun Shin, Donghwan Shin, Doo-Hwan Bae","doi":"10.1145/3633804","DOIUrl":"https://doi.org/10.1145/3633804","url":null,"abstract":"<p>Cyber-Physical Systems (CPS) continuously interact with their physical environments through embedded software controllers that observe the environments and determine actions. Field Operational Tests (FOT) are essential to verify to what extent the CPS under analysis can achieve certain CPS goals, such as satisfying the safety and performance requirements, while interacting with the real operational environment. However, performing many FOTs to obtain statistically significant verification results is challenging due to its high cost and risk in practice. Simulation-based verification can be an alternative to address the challenge, but it still requires an accurate virtual environment model that can replace the real environment interacting with the CPS in a closed loop. </p><p>In this paper, we propose ENVI (ENVironment Imitation), a novel approach to automatically generate an accurate virtual environment model, enabling efficient and accurate simulation-based CPS goal verification in practice.To do this, we first formally define the problem of the virtual environment model generation and solve it by leveraging Imitation Learning (IL), which has been actively studied in machine learning to learn complex behaviors from expert demonstrations. The key idea behind the model generation is to leverage IL for training a model that imitates the interactions between the CPS controller and its real environment as recorded in (possibly very small) FOT logs. We then statistically verify the goal achievement of the CPS by simulating it with the generated model. We empirically evaluate ENVI by applying it to the verification of two popular autonomous driving assistant systems. The results show that ENVI can reduce the cost of CPS goal verification while maintaining its accuracy by generating accurate environment models from only a few FOT logs. The use of IL in virtual environment model generation opens new research directions, further discussed at the end of the paper.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"49 2","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IoV-Fog-Assisted Framework for Accident Detection and Classification","authors":"Navin Kumar, Sandeep Kumar Sood, Munish Saini","doi":"10.1145/3633805","DOIUrl":"https://doi.org/10.1145/3633805","url":null,"abstract":"<p>The evolution of vehicular research into an effectuating area like the Internet of Vehicles (IoV) was verified by technical developments in hardware. The integration of the Internet of Things (IoT) and Vehicular Ad-hoc Networks (VANET) has significantly impacted addressing various problems, from dangerous situations to finding practical solutions. During a catastrophic collision, the vehicle experiences extreme turbulence, which may be captured using Micro-Electromechanical systems (MEMS) to yield signatures characterizing the severity of the accident. This study presents a three-layer design, with the data collecting layer relying on a low-power IoT configuration that includes GPS and an MPU 6050 placed on an Arduino Mega. The fog layer oversees data pre-processing and other low-level computing operations. With its extensive computing capabilities, the farthest cloud layer carries out Multidimensional Dynamic Time Warping (MDTW) to identify accidents and maintains the information repository by updating it. The experimentation compared the state-of-the-art algorithms such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest Tree (RFT) using threshold-based detection with the proposed MDTW clustering approach. Data collection involves simulating accidents via VirtualCrash for training and testing, whereas the IoV circuitry would be utilized in actual real-life scenarios. The proposed approach achieved an F1-Score of 0.8921 and 0.8184 for rear and head-on collisions.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"50 5","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COBRRA: COntention aware cache Bypass with Request-Response Arbitration","authors":"Aritra Bagchi, Dinesh Joshi, Preeti Ranjan Panda","doi":"10.1145/3632748","DOIUrl":"https://doi.org/10.1145/3632748","url":null,"abstract":"<p>In modern multi-processor systems-on-chip (MPSoCs), requests from different processor cores, accelerators, and their responses from the lower level memory contend for the shared cache bandwidth, making it a critical performance bottleneck. Prior research on shared cache management has considered requests from cores, but has ignored crucial contributions from their responses. Prior cache bypass techniques focused on data reuse and neglected the system-level implications of shared cache contention. We propose COBRRA, a novel shared cache controller policy that mitigates the contention by aggressively bypassing selected responses from the lower level memory, and scheduling the remaining requests and responses to the cache efficiently. COBRRA is able to improve the average performance of a set of 15 SPEC workloads by (49% ) and (33% ) compared to the no-bypass baseline and the best performing state-of-the-art bypass solution, respectively. Furthermore, COBRRA reduces the overall cache energy consumption by (38% ) and (31% ) compared to the no-bypass baseline and the most energy-efficient state-of-the-art bypass solution, respectively.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"49 8","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Aware Adaptive Mixed-Criticality Scheduling with Semi-Clairvoyance and Graceful Degradation","authors":"Yi-Wen Zhang, Hui Zheng, Zonghua Gu","doi":"10.1145/3632749","DOIUrl":"https://doi.org/10.1145/3632749","url":null,"abstract":"The classic Mixed-Criticality System (MCS) task model is a non-clairvoyance model in which the change of the system behavior is based on the completion of high-criticality tasks while dropping low-criticality tasks in high-criticality mode. In this paper, we simultaneously consider graceful degradation and semi-clairvoyance in MCS. We first propose the analysis for adaptive mixed-criticality with semi-clairvoyance denoted as C-AMC-sem. The so-called semi-clairvoyance refers to the system’s behavior change being revealed at the time that jobs are released. Moreover, we propose a new algorithm based on C-AMC-sem to reduce energy consumption. Finally, we verify the performance of the proposed algorithms via experiments upon synthetically generated tasksets. The experimental results indicate that the proposed algorithms significantly outperform the existing algorithms.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"46 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136347911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}