Mario Günzel, Niklas Ueter, Kuan-Hsun Chen, Georg von der Brüggen, Jian-Jia Chen
{"title":"Probabilistic Reaction Time Analysis","authors":"Mario Günzel, Niklas Ueter, Kuan-Hsun Chen, Georg von der Brüggen, Jian-Jia Chen","doi":"10.1145/3609390","DOIUrl":"https://doi.org/10.1145/3609390","url":null,"abstract":"In many embedded systems, for instance, in the automotive, avionic, or robotics domain, critical functionalities are implemented via chains of communicating recurrent tasks. To ensure safety and correctness of such systems, guarantees on the reaction time, that is, the delay between a cause (e.g., an external activity or reading of a sensor) and the corresponding effect, must be provided. Current approaches focus on the maximum reaction time, considering the worst-case system behavior. However, in many scenarios, probabilistic guarantees on the reaction time are sufficient. That is, it is sufficient to provide a guarantee that the reaction does not exceed a certain threshold with (at least) a certain probability. This work provides such probabilistic guarantees on the reaction time, considering two types of randomness: response time randomness and failure probabilities. To the best of our knowledge, this is the first work that defines and analyzes probabilistic reaction time for cause-effect chains based on sporadic tasks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kourosh Vali, Ata Vafi, Begum Kasap, Soheil Ghiasi
{"title":"BASS: Safe Deep Tissue Optical Sensing for Wearable Embedded Systems","authors":"Kourosh Vali, Ata Vafi, Begum Kasap, Soheil Ghiasi","doi":"10.1145/3607916","DOIUrl":"https://doi.org/10.1145/3607916","url":null,"abstract":"In wearable optical sensing applications whose target tissue is not superficial, such as deep tissue oximetry, the task of embedded system design has to strike a balance between two competing factors. On one hand, the sensing task is assisted by increasing the radiated energy into the body, which in turn, improves the signal-to-noise ratio (SNR) of the deep tissue at the sensor. On the other hand, patient safety consideration imposes a constraint on the amount of radiated energy into the body. In this paper, we study the trade-offs between the two factors by exploring the design space of the light source activation pulse. Furthermore, we propose BASS, an algorithm that leverages the activation pulse design space exploration, which further optimizes deep tissue SNR via spectral averaging, while ensuring the radiated energy into the body meets a safe upper bound. The effectiveness of the proposed technique is demonstrated via analytical derivations, simulations, and in vivo measurements in both pregnant sheep models and human subjects.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen
{"title":"<scp>EMS-i</scp> : An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference","authors":"Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen","doi":"10.1145/3609384","DOIUrl":"https://doi.org/10.1145/3609384","url":null,"abstract":"Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i , an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova
{"title":"Sound Mixed Fixed-Point Quantization of Neural Networks","authors":"Debasmita Lohar, Clothilde Jeangoudoux, Anastasia Volkova, Eva Darulova","doi":"10.1145/3609118","DOIUrl":"https://doi.org/10.1145/3609118","url":null,"abstract":"Neural networks are increasingly being used as components in safety-critical applications, for instance, as controllers in embedded systems. Their formal safety verification has made significant progress but typically considers only idealized real-valued networks. For practical applications, such neural networks have to be quantized, i.e., implemented in finite-precision arithmetic, which inevitably introduces roundoff errors. Choosing a suitable precision that is both guaranteed to satisfy a roundoff error bound to ensure safety and that is as small as possible to not waste resources is highly nontrivial to do manually. This task is especially challenging when quantizing a neural network in fixed-point arithmetic, where one can choose among a large number of precisions and has to ensure overflow-freedom explicitly. This paper presents the first sound and fully automated mixed-precision quantization approach that specifically targets deep feed-forward neural networks. Our quantization is based on mixed-integer linear programming (MILP) and leverages the unique structure of neural networks and effective over-approximations to make MILP optimization feasible. Our approach efficiently optimizes the number of bits needed to implement a network while guaranteeing a provided error bound. Our evaluation on existing embedded neural controller benchmarks shows that our optimization translates into precision assignments that mostly use fewer machine cycles when compiled to an FPGA with a commercial HLS compiler than code generated by (sound) state-of-the-art. Furthermore, our approach handles significantly more benchmarks substantially faster, especially for larger networks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic Black-Box Checking via Active MDP Learning","authors":"Junya Shijubo, Masaki Waga, Kohei Suenaga","doi":"10.1145/3609127","DOIUrl":"https://doi.org/10.1145/3609127","url":null,"abstract":"We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system’s specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box’s behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system’s observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Verified Compilation of Synchronous Dataflow with State Machines","authors":"Timothy Bourke, Basile Pesin, Marc Pouzet","doi":"10.1145/3608102","DOIUrl":"https://doi.org/10.1145/3608102","url":null,"abstract":"Safety-critical embedded software is routinely programmed in block-diagram languages. Recent work in the Vélus project specifies such a language and its compiler in the Coq proof assistant. It builds on the CompCert verified C compiler to give an end-to-end proof linking the dataflow semantics of source programs to traces of the generated assembly code. We extend this work with switched blocks, shared variables, reset blocks, and state machines; define a relational semantics to integrate these block- and mode-based constructions into the existing stream-based model; adapt the standard source-to-source rewriting scheme to compile the new constructions; and reestablish the correctness theorem.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rectifying Skewed Kernel Page Reclamation in Mobile Devices for Improving User-Perceivable Latency","authors":"Yi-Quan Chou, Lin-Wei Shen, Li-Pin Chang","doi":"10.1145/3607937","DOIUrl":"https://doi.org/10.1145/3607937","url":null,"abstract":"A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs","authors":"Dipika Deb, John Jose","doi":"10.1145/3609113","DOIUrl":"https://doi.org/10.1145/3609113","url":null,"abstract":"Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard
{"title":"Consistency vs. Availability in Distributed Cyber-Physical Systems","authors":"Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard","doi":"10.1145/3609119","DOIUrl":"https://doi.org/10.1145/3609119","url":null,"abstract":"In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136191886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader
{"title":"Kryptonite: Worst-Case Program Interference Estimation on Multi-Core Embedded Systems","authors":"Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader","doi":"10.1145/3609128","DOIUrl":"https://doi.org/10.1145/3609128","url":null,"abstract":"Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}