Kourosh Vali, Ata Vafi, Begum Kasap, Soheil Ghiasi
{"title":"BASS: Safe Deep Tissue Optical Sensing for Wearable Embedded Systems","authors":"Kourosh Vali, Ata Vafi, Begum Kasap, Soheil Ghiasi","doi":"10.1145/3607916","DOIUrl":"https://doi.org/10.1145/3607916","url":null,"abstract":"In wearable optical sensing applications whose target tissue is not superficial, such as deep tissue oximetry, the task of embedded system design has to strike a balance between two competing factors. On one hand, the sensing task is assisted by increasing the radiated energy into the body, which in turn, improves the signal-to-noise ratio (SNR) of the deep tissue at the sensor. On the other hand, patient safety consideration imposes a constraint on the amount of radiated energy into the body. In this paper, we study the trade-offs between the two factors by exploring the design space of the light source activation pulse. Furthermore, we propose BASS, an algorithm that leverages the activation pulse design space exploration, which further optimizes deep tissue SNR via spectral averaging, while ensuring the radiated energy into the body meets a safe upper bound. The effectiveness of the proposed technique is demonstrated via analytical derivations, simulations, and in vivo measurements in both pregnant sheep models and human subjects.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen
{"title":"<scp>EMS-i</scp> : An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference","authors":"Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen","doi":"10.1145/3609384","DOIUrl":"https://doi.org/10.1145/3609384","url":null,"abstract":"Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i , an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"695 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rectifying Skewed Kernel Page Reclamation in Mobile Devices for Improving User-Perceivable Latency","authors":"Yi-Quan Chou, Lin-Wei Shen, Li-Pin Chang","doi":"10.1145/3607937","DOIUrl":"https://doi.org/10.1145/3607937","url":null,"abstract":"A crucial design factor for users of smart mobile devices is the latency of graphical interface interaction. Switching a background app to foreground is a frequent operation on mobile devices and the latency of this process is highly perceivable to users. Based on an Android smartphone, through analysis of memory reference generated during the app-switching process, we observe that file (virtual) pages and anonymous pages are both heavily involved. However, to our surprise, the amounts of the two types of pages in the main memory are highly imbalanced, and frequent I/O operations on file pages noticeably slows down the app-switching process. In this study, we advocate to improve the app-switching latency by rectifying the skewed kernel page reclaiming. Our approach involves two parts: proactive identification of unused anonymous pages and adaptive balance between file pages and anonymous pages. As mobile apps are found inflating their anonymous pages, we propose identifying unused anonymous pages in sync with the app-switching events. In addition, Android devices replaces the swap device with RAM-based zram, and swapping on zram is much faster than file accessing on flash storage. Without causing thrashing, we propose swapping out as many anonymous pages to zram as possible for caching more file pages. We conduct experiments on a Google Pixel phone with realistic user workloads, and results confirm that our method is adaptive to different memory requirements and greatly improves the app-switching latency by up to 43% compared with the original kernel.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs","authors":"Dipika Deb, John Jose","doi":"10.1145/3609113","DOIUrl":"https://doi.org/10.1145/3609113","url":null,"abstract":"Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention for shared resources such as LLC and the underlying network. The paper proposes Zero Pollution Prefetcher (ZPP) that eliminates cache pollution for NUCA architecture. For this purpose, ZPP uses L1 prefetcher and places the prefetched blocks in the data locations of LLC where modified blocks are stored. Since modified blocks in LLC are stale and request for such blocks are served from the exclusively owned private cache, their space unnecessary consumes power to maintain such stale data in the cache. The benefits of ZPP are (a) Eliminates cache pollution in L1 and LLC by storing prefetched blocks in LLC locations where stale blocks are stored. (b) Insufficient cache space is solved by placing prefetched blocks in LLC as LLCs are larger in size than L1 cache. This helps in prefetching more cache blocks, thereby increasing prefetch aggressiveness. (c) Increasing prefetch aggressiveness increases its coverage. (d) It also maintains an equivalent lookup latency to L1 cache for prefetched blocks. Experimentally it has been found that ZPP increases weighted speedup by 2.19x as compared to a system with no prefetching while prefetch coverage and prefetch accuracy increases by 50%, and 12%, respectively compared to the baseline.1","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Verified Compilation of Synchronous Dataflow with State Machines","authors":"Timothy Bourke, Basile Pesin, Marc Pouzet","doi":"10.1145/3608102","DOIUrl":"https://doi.org/10.1145/3608102","url":null,"abstract":"Safety-critical embedded software is routinely programmed in block-diagram languages. Recent work in the Vélus project specifies such a language and its compiler in the Coq proof assistant. It builds on the CompCert verified C compiler to give an end-to-end proof linking the dataflow semantics of source programs to traces of the generated assembly code. We extend this work with switched blocks, shared variables, reset blocks, and state machines; define a relational semantics to integrate these block- and mode-based constructions into the existing stream-based model; adapt the standard source-to-source rewriting scheme to compile the new constructions; and reestablish the correctness theorem.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks","authors":"Harsh Sharma, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, Partha Pratim Pande","doi":"10.1145/3608098","DOIUrl":"https://doi.org/10.1145/3608098","url":null,"abstract":"Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza
{"title":"Overflow-free Compute Memories for Edge AI Acceleration","authors":"Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza","doi":"10.1145/3609387","DOIUrl":"https://doi.org/10.1145/3609387","url":null,"abstract":"Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard
{"title":"Consistency vs. Availability in Distributed Cyber-Physical Systems","authors":"Edward A. Lee, Ravi Akella, Soroush Bateni, Shaokai Lin, Marten Lohstroh, Christian Menard","doi":"10.1145/3609119","DOIUrl":"https://doi.org/10.1145/3609119","url":null,"abstract":"In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We extend consistency and availability to refer to cyber-physical properties such as the state of the physical system and delays in actuation. We have further extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design cyber-physical systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable embedded software.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136191886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader
{"title":"Kryptonite: Worst-Case Program Interference Estimation on Multi-Core Embedded Systems","authors":"Nikhilesh Singh, Karthikeyan Renganathan, Chester Rebeiro, Jithin Jose, Ralph Mader","doi":"10.1145/3609128","DOIUrl":"https://doi.org/10.1145/3609128","url":null,"abstract":"Due to the low costs and energy needed, cyber-physical systems are adopting multi-core processors for their embedded computing requirements. In order to guarantee safety when the application has real-time constraints, a critical requirement is to estimate the worst-case interference from other executing programs. However, the complexity of multi-core hardware inhibits precisely determining the Worst-Case Program Interference. Existing solutions are either prone to overestimate the interference or are not scalable to different hardware sizes and designs. In this paper we present Kryptonite , an automated framework to synthesize Worst-Case Program Interference (WCPI) environments for multi-core systems. Fundamental to Kryptonite is a set of tiny hardware-specific code gadgets that are crafted to maximize interference locally. The gadgets are arranged using a greedy approach and then molded using a Reinforcement Learning algorithm to create the WCPI environment. We demonstrate Kryptonite on the automotive grade Infineon AURIX TC399 processor with a wide range of programs that includes a commercial real-time automotive application. We show that, while being easily scalable and tunable, Kryptonite creates WCPI environments increasing the runtime by up to 58% for benchmark applications and 26% for the automotive application.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FSIMR: File-system-aware Data Management for Interlaced Magnetic Recording","authors":"Yi-Han Lien, Yen-Ting Chen, Yuan-Hao Chang, Yu-Pei Liang, Wei-Kuan Shih","doi":"10.1145/3607922","DOIUrl":"https://doi.org/10.1145/3607922","url":null,"abstract":"Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}