{"title":"ViT4Mal: Lightweight Vision Transformer for Malware Detection on Edge Devices","authors":"Akshara Ravi, Vivek Chaturvedi, Muhammad Shafique","doi":"10.1145/3609112","DOIUrl":"https://doi.org/10.1145/3609112","url":null,"abstract":"There has been a tremendous growth of edge devices connected to the network in recent years. Although these devices make our life simpler and smarter, they need to perform computations under severe resource and energy constraints, while being vulnerable to malware attacks. Once compromised, these devices are further exploited as attack vectors targeting critical infrastructure. Most existing malware detection solutions are resource and compute-intensive and hence perform poorly in protecting edge devices. In this paper, we propose a novel approach ViT4Mal that utilizes a lightweight vision transformer (ViT) for malware detection on an edge device. ViT4Mal first converts executable byte-code into images to learn malware features and later uses a customized lightweight ViT to detect malware with high accuracy. We have performed extensive experiments to compare our model with state-of-the-art CNNs in the malware detection domain. Experimental results corroborate that ViTs don’t demand deeper networks to achieve comparable accuracy of around 97% corresponding to heavily structured CNN models. We have also performed hardware deployment of our proposed lightweight ViT4Mal model on the Xilinx PYNQ Z1 FPGA board by applying specialized hardware optimizations such as quantization, loop pipelining, and array partitioning. ViT4Mal achieved an accuracy of ~94% and a 41x speedup compared to the original ViT model.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CABARRE: Request Response Arbitration for Shared Cache Management","authors":"Garima Modi, Aritra Bagchi, Neetu Jindal, Ayan Mandal, Preeti Ranjan Panda","doi":"10.1145/3608096","DOIUrl":"https://doi.org/10.1145/3608096","url":null,"abstract":"Modern multi-processor systems-on-chip (MPSoCs) are characterized by caches shared by multiple cores. These shared caches receive requests issued by the processor cores. Requests that are subject to cache misses may result in the generation of responses . These responses are received from the lower level of the memory hierarchy and written to the cache. The outstanding requests and responses contend for the shared cache bandwidth. To mitigate the impact of the cache bandwidth contention on the overall system performance, an efficient request and response arbitration policy is needed. Research on shared cache management has neglected the additional cache contention caused by responses, which are written to the cache. We propose CABARRE , a novel request and response arbitration policy at shared caches, so as to improve the overall system performance. CABARRE shows a performance improvement of 23% on average across a set of SPEC workloads compared to straightforward adaptations of state-of-the-art solutions.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio
{"title":"<scp>Hephaestus</scp> : Codesigning and Automating 3D Image Registration on Reconfigurable Architectures","authors":"Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio","doi":"10.1145/3607928","DOIUrl":"https://doi.org/10.1145/3607928","url":null,"abstract":"Healthcare is a pivotal research field, and medical imaging is crucial in many applications. Therefore finding new architectural and algorithmic solutions would benefit highly repetitive image processing procedures. One of the most complex tasks in this sense is image registration, which finds the optimal geometric alignment among 3D image stacks and is widely employed in healthcare and robotics. Given the high computational demand of such a procedure, hardware accelerators are promising real-time and energy-efficient solutions, but they are complex to design and integrate within software pipelines. Therefore, this work presents an automation framework called Hephaestus that generates efficient 3D image registration pipelines combined with reconfigurable accelerators. Moreover, to alleviate the burden from the software, we codesign software-programmable accelerators that can adapt at run-time to the image volume dimensions. Hephaestus features a cross-platform abstraction layer that enables transparently high-performance and embedded systems deployment. However, given the computational complexity of 3D image registration, the embedded devices become a relevant and complex setting being constrained in memory; thus, they require further attention and tailoring of the accelerators and registration application to reach satisfactory results. Therefore, with Hephaestus , we also propose an approximation mechanism that enables such devices to perform the 3D image registration and even achieve, in some cases, the accuracy of the high-performance ones. Overall, Hephaestus demonstrates 1.85× of maximum speedup, 2.35× of efficiency improvement with respect to the State of the Art, a maximum speedup of 2.51× and 2.76× efficiency improvements against our software, while attaining state-of-the-art accuracy on 3D registrations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke
{"title":"BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks","authors":"Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke","doi":"10.1145/3609093","DOIUrl":"https://doi.org/10.1145/3609093","url":null,"abstract":"Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Louis Colaço, Michael Mendler, Baptiste Pauget, Marc Pouzet
{"title":"A Constructive State-based Semantics and Interpreter for a Synchronous Data-flow Language with State Machines","authors":"Jean-Louis Colaço, Michael Mendler, Baptiste Pauget, Marc Pouzet","doi":"10.1145/3609131","DOIUrl":"https://doi.org/10.1145/3609131","url":null,"abstract":"Scade is a domain-specific synchronous functional language used to implement safety-critical real-time software for more than twenty years. Two main approaches have been considered for its semantics: (i) an indirect collapsing semantics based on a source-to-source translation of high-level constructs into a data-flow core language whose semantics is precisely specified and is the entry for code generation; a relational synchronous semantics , akin to Esterel, that applies directly to the source. It defines what is a valid synchronous reaction but hides, on purpose, if a semantics exists, is unique and can be computed; hence, it is not executable. This paper presents, for the first time, an executable , state-based semantics for a language that has the key constructs of Scade all together, in particular the arbitrary combination of data-flow equations and hierarchical state machines. It can apply directly to the source language before static checks and compilation steps. It is constructive in the sense that the language in which the semantics is defined is a statically typed functional language with call-by-value and strong normalization, e.g., it is expressible in a proof-assistant where all functions terminate. It leads to a reference, purely functional, interpreter. This semantics is modular and can account for possible errors, allowing to establish what property is ensured by each static verification performed by the compiler. It also clarifies how causality is treated in Scade compared with Esterel. This semantics can serve as an oracle for compiler testing and validation; to prototype novel language constructs before they are implemented, to execute possibly unfinished models or that are correct but rejected by the compiler; to prove the correctness of compilation steps. The semantics given in the paper is implemented as an interpreter in a purely functional style, in OCaml.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"<scp>ObNoCs</scp> : Protecting Network-on-Chip Fabrics Against Reverse-Engineering Attacks","authors":"Dipal Halder, Maneesh Merugu, Sandip Ray","doi":"10.1145/3609107","DOIUrl":"https://doi.org/10.1145/3609107","url":null,"abstract":"Modern System-on-Chip designs typically use Network-on-Chip (NoC) fabrics to implement coordination among integrated hardware blocks. An important class of security vulnerabilities involves a rogue foundry reverse-engineering the NoC topology and routing logic. In this paper, we develop an infrastructure, ObNoCs , for protecting NoC fabrics against such attacks. ObNoCs systematically replaces router connections with switches that can be programmed after fabrication to induce the desired topology. Our approach provides provable redaction of NoC functionality: switch configurations induce a large number of legal topologies, only one of which corresponds to the intended topology. We implement the ObNoCs methodology on Intel Quartus™ Platform, and experimental results on realistic SoC designs show that the architecture incurs minimal overhead in power, resource utilization, and system latency.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proactive Stripe Reconstruction to Improve Cache Use Efficiency of SSD-Based RAID Systems","authors":"Zhibing Sha, Jiaojiao Wu, Jun Li, Balazs Gerofi, Zhigang Cai, Jianwei Liao","doi":"10.1145/3609099","DOIUrl":"https://doi.org/10.1145/3609099","url":null,"abstract":"Solid-State Drives (SSDs) exhibit different failure characteristics compared to conventional hard disk drives. In particular, the Bit Error Rate (BER) of an SSD increases as it bears more writes. Then, Parity-based Redundant Array of Inexpensive Disks (RAID) arrays composed from SSDs are introduced to address correlated failures. In the RAID-5 implementation, specifically, the process of parity generation (or update) associating with a data stripe, consists of read and write operations to the SSDs. Whenever a new update request comes to the RAID system, the related parity must be also updated and flushed onto the RAID component of SSD. Such frequent parity updates result in poor RAID performance and shorten the life-time of the SSDs. Consequently, a DRAM cache is commonly equipped accompanying with the RAID controller, called the parity cache, and used to buffer the parity chunks that are most frequently updated data, for boosting I/O performance. To better improve the use efficiency of the parity cache, this paper proposes a stripe reconstruction approach to minimize the number of parity updates on SSDs, thus boosting I/O performance of the SSD RAID system. When the currently updated stripe has both cold and hot updated data chunks, it will proactively carry out stripe reconstruction if we can find another matched stripe that also includes cold and hot update data chunks on the complementary RAID components. In the reconstruction process, we first group the cold data chunks of two matched stripes, to build a new stripe and flush the parity chunk on the RAID component. After that, the hot data chunks are organized as a new stripe as well, and its parity chunk is buffered in the parity cache. This results in better cache use efficiency, as it can reduce the number of parity updates on RAID components of SSDs, as well as proactively free up cache space for quickly absorbing subsequent write requests. In addition, the proposed method adjusts the target SSD of write requests based on stripe reconstructions through considering the I/O workload balance of all SSDs. Experimental results show that our proposal can reduce the number of parity chunk updates in SSDs by 2.3% and overall I/O latency by 12.2% on average, compared to state-of-the-art parity cache management techniques.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Methods to Realize Preemption in Phased Execution Models","authors":"Thilanka Thilakasiri, Matthias Becker","doi":"10.1145/3609132","DOIUrl":"https://doi.org/10.1145/3609132","url":null,"abstract":"Phased execution models are a good solution to tame the increased complexity and contention of commercial off-the-shelf (COTS) multi-core platforms, e.g., Acquisition-Execution-Restitution (AER) model, PRedictable Execution Model (PREM). Such models separate execution from access to shared resources on the platform to minimize contention. All data and instructions needed during an execution phase are copied into the local memory of the core before starting to execute. Phased execution models are generally used with non-preemptive scheduling to increase predictability. However, the blocking time in non-preemptive systems can reduce schedulability. Therefore, an investigation of preemption methods for phased execution models is warranted. Although, preemption for phased execution models must be carefully designed to retain its execution semantics, i.e., the handling of local memory during preemption becomes non-trivial. This paper investigates different methods to realize preemption in phased execution models while preserving their semantics. To the best of our knowledge, this is the first paper to explore different approaches to implement preemption in phased execution models from the perspective of data management. We introduce two strategies to realize preemption of execution phases based on different methods of handling local data of the preempted task. Heuristics are used to create time-triggered schedules for task sets that follow the proposed preemption methods. Additionally, a schedulability-aware preemption heuristic is proposed to reduce the number of preemptions by allowing preemption only when it is beneficial in terms of schedulability. Evaluations on a large number of synthetic task sets are performed to compare the proposed preemption models against each other and against a non-preemptive version. Furthermore, our schedulability-aware preemption heuristic has higher schedulability with a clear margin in all our experiments compared to the non-preemptive and fully-preemptive versions.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedHIL: Heterogeneity Resilient Federated Learning for Robust Indoor Localization with Mobile Devices","authors":"Danish Gufran, Sudeep Pasricha","doi":"10.1145/3607919","DOIUrl":"https://doi.org/10.1145/3607919","url":null,"abstract":"Indoor localization plays a vital role in applications such as emergency response, warehouse management, and augmented reality experiences. By deploying machine learning (ML) based indoor localization frameworks on their mobile devices, users can localize themselves in a variety of indoor and subterranean environments. However, achieving accurate indoor localization can be challenging due to heterogeneity in the hardware and software stacks of mobile devices, which can result in inconsistent and inaccurate location estimates. Traditional ML models also heavily rely on initial training data, making them vulnerable to degradation in performance with dynamic changes across indoor environments. To address the challenges due to device heterogeneity and lack of adaptivity, we propose a novel embedded ML framework called FedHIL . Our framework combines indoor localization and federated learning (FL) to improve indoor localization accuracy in device-heterogeneous environments while also preserving user data privacy. FedHIL integrates a domain-specific selective weight adjustment approach to preserve the ML model's performance for indoor localization during FL, even in the presence of extremely noisy data. Experimental evaluations in diverse real-world indoor environments and with heterogeneous mobile devices show that FedHIL outperforms state-of-the-art FL and non-FL indoor localization frameworks. FedHIL is able to achieve 1.62 × better localization accuracy on average than the best performing FL-based indoor localization framework from prior work.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Szeto, Edward Andert, Aviral Shrivastava, Martin Reisslein, Chung-Wei Lin, Christ Richmond
{"title":"B-AWARE: Blockage Aware RSU Scheduling for 5G Enabled Autonomous Vehicles","authors":"Matthew Szeto, Edward Andert, Aviral Shrivastava, Martin Reisslein, Chung-Wei Lin, Christ Richmond","doi":"10.1145/3609133","DOIUrl":"https://doi.org/10.1145/3609133","url":null,"abstract":"5G Millimeter Wave (mmWave) technology holds great promise for Connected Autonomous Vehicles (CAVs) due to its ability to achieve data rates in the Gbps range. However, mmWave suffers from a high beamforming overhead and requirement of line of sight (LOS) to maintain a strong connection. For Vehicle-to-Infrastructure (V2I) scenarios, where CAVs connect to roadside units (RSUs), these drawbacks become apparent. Because vehicles are dynamic, there is a large potential for link blockages. These blockages are detrimental to the connected applications running on the vehicle, such as cooperative perception and remote driver takeover. Existing RSU selection schemes base their decisions on signal strength and vehicle trajectory alone, which is not enough to prevent the blockage of links. Many modern CAVs motion planning algorithms routinely use other vehicle’s near-future path plans, either by explicit communication among vehicles, or by prediction. In this paper, we make use of the knowledge of other vehicle’s near future path plans to further improve the RSU association mechanism for CAVs. We solve the RSU association algorithm by converting it to a shortest path problem with the objective to maximize the total communication bandwidth. We evaluate our approach, titled B-AWARE, in simulation using Simulation of Urban Mobility (SUMO) and Digital twin for self-dRiving Intelligent VEhicles (DRIVE) on 12 highway and city street scenarios with varying traffic density and RSU placements. Simulations show B-AWARE results in a 1.05× improvement of the potential datarate in the average case and 1.28× in the best case vs. the state-of-the-art. But more impressively, B-AWARE reduces the time spent with no connection by 42% in the average case and 60% in the best case as compared to the state-of-the-art methods. This is a result of B-AWARE reducing nearly 100% of blockage occurrences.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}