{"title":"IOSR: Improving I/O Efficiency for Memory Swapping on Mobile Devices Via Scheduling and Reshaping","authors":"Wentong Li, Liang Shi, Hang Li, Changlong Li, Edwin Hsing-Mean Sha","doi":"10.1145/3607923","DOIUrl":"https://doi.org/10.1145/3607923","url":null,"abstract":"Mobile systems and applications are becoming increasingly feature-rich and powerful, which constantly suffer from memory pressure, especially for devices equipped with limited DRAM. Swapping inactive DRAM pages to the storage device is a promising solution to extend the physical memory. However, existing mobile devices usually adopt flash memory as the storage device, where swapping DRAM pages to flash memory may introduce significant performance overhead. In this paper, we first conduct an in-depth analysis of the I/O characteristics of the flash-based memory swapping, including the I/O interference and swap I/O randomness in swap subsystem. Then an I/O efficiency optimization framework for memory swapping (IOSR) is proposed to enhance the performance of flash-based memory swapping for mobile devices. IOSR consists of two methods: swap I/O scheduling (SIOS) and swap I/O pattern reshaping (SIOR). SIOS is designed to schedule the swap I/O to reduce interference with other processes I/Os. SIOR is designed to reshape the swap I/O pattern with process-oriented swap slot allocation and adaptive granularity swap read-ahead. IOSR is implemented on Google Pixel 4. Experimental results show that IOSR reduces the application switching time by 31.7% and improves the swap-in bandwidth by 35.5% on average compared to the state-of-the-art.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CIM: A Novel Clustering-based Energy-Efficient Data Imputation Method for Human Activity Recognition","authors":"Dina Hussein, Ganapati Bhat","doi":"10.1145/3609111","DOIUrl":"https://doi.org/10.1145/3609111","url":null,"abstract":"Human activity recognition (HAR) is an important component in a number of health applications, including rehabilitation, Parkinson’s disease, daily activity monitoring, and fitness monitoring. State-of-the-art HAR approaches use multiple sensors on the body to accurately identify activities at runtime. These approaches typically assume that data from all sensors are available for runtime activity recognition. However, data from one or more sensors may be unavailable due to malfunction, energy constraints, or communication challenges between the sensors. Missing data can lead to significant degradation in the accuracy, thus affecting quality of service to users. A common approach for handling missing data is to train classifiers or sensor data recovery algorithms for each combination of missing sensors. However, this results in significant memory and energy overhead on resource-constrained wearable devices. In strong contrast to prior approaches, this paper presents a clustering-based approach (CIM) to impute missing data at runtime. We first define a set of possible clusters and representative data patterns for each sensor in HAR. Then, we create and store a mapping between clusters across sensors. At runtime, when data from a sensor are missing, we utilize the stored mapping table to obtain most likely cluster for the missing sensor. The representative window for the identified cluster is then used as imputation to perform activity classification. We also provide a method to obtain imputation-aware activity prediction sets to handle uncertainty in data when using imputation. Experiments on three HAR datasets show that CIM achieves accuracy within 10% of a baseline without missing data for one missing sensor when providing single activity labels. The accuracy gap drops to less than 1% with imputation-aware classification. Measurements on a low-power processor show that CIM achieves close to 100% energy savings compared to state-of-the-art generative approaches.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
{"title":"MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment","authors":"Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque","doi":"10.1145/3609386","DOIUrl":"https://doi.org/10.1145/3609386","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both ( i ) a real hardware SoC platform (NVIDIA Xavier AGX) and ( ii ) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57 × latency speedup and is 3.38 × more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras
{"title":"DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips","authors":"Toygun Basaklar, A. Alper Goksoy, Anish Krishnakumar, Suat Gumussoy, Umit Y. Ogras","doi":"10.1145/3609108","DOIUrl":"https://doi.org/10.1145/3609108","url":null,"abstract":"Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WARM-tree: Making Quadtrees Write-efficient and Space-economic on Persistent Memories","authors":"Shin-Ting Wu, Liang-Chi Chen, Po-Chun Huang, Yuan-Hao Chang, Chien-Chung Ho, Wei-Kuan Shih","doi":"10.1145/3608033","DOIUrl":"https://doi.org/10.1145/3608033","url":null,"abstract":"Recently, the value of data has been widely recognized, which highlights the significance of data-centric computing in diversified application scenarios. In many cases, the data are multidimensional, and the management of multidimensional data often confronts greater challenges in supporting efficient data access operations and guaranteeing the space utilization. On the other hand, while many existing index data structures have been proposed for multidimensional data management, however, their designs are not fully optimized for modern nonvolatile memories, in particular the byte-addressable persistent memories. As a result, they might undergo serious access performance degradation or fail to guarantee space utilization. This observation motivates the redesigning of index data structures for multidimensional point data on modern persistent memories, such as the phase-change memory. In this work, we present the WARM-tree , a m ultidimensional t ree for r educing the w rite a mplification effect, for multidimensional point data. In our evaluation studies, as compared to the bucket PR quadtree and R*-tree, the WARM-tree can provide any worst-case space utilization guarantees in the form of (frac{m-1}{m}) ( m ∈ ℤ^+) and effectively reduces the write traffic of key insertions by up to 48.10% and 85.86%, respectively, at the price of degraded average space utilization and prolonged latency of query operations. This suggests that the WARM-tree is a potential multidimensional index structure for insert-intensive workloads.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators","authors":"Chengpeng Xia, Yawen Chen, Haibo Zhang, Jigang Wu","doi":"10.1145/3607920","DOIUrl":"https://doi.org/10.1145/3607920","url":null,"abstract":"Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic Analysis of Control Systems Subject to Communication and Computation Faults","authors":"Nils Vreman, Martina Maggio","doi":"10.1145/3609123","DOIUrl":"https://doi.org/10.1145/3609123","url":null,"abstract":"Control theory allows one to design controllers that are robust to external disturbances, model simplification, and modelling inaccuracy. Researchers have investigated whether the robustness carries on to the controller’s digital implementation, mostly looking at how the controller reacts to either communication or computational problems. Communication problems are typically modelled using random variables (i.e., estimating the probability that a fault will occur during a transmission), while computational problems are modelled using deterministic guarantees on the number of deadlines that the control task has to meet. These fault models allow the engineer to both design robust controllers and assess the controllers’ behaviour in the presence of isolated faults. Despite being very relevant for the real-world implementations of control system, the question of what happens when these faults occur simultaneously does not yet have a proper answer. In this paper, we answer this question in the stochastic setting, using the theory of Markov Jump Linear Systems to provide stability contracts with almost sure guarantees of convergence. For linear time-invariant Markov jump linear systems, mean square stability implies almost sure convergence – a property that is central to our investigation. Our research primarily emphasises the validation of this property for closed-loop systems that are subject to packet losses and computational overruns, potentially occurring simultaneously. We apply our method to two case studies from the recent literature and show their robustness to a comprehensive set of faults. We employ closed-loop system simulations to empirically derive performance metrics that elucidate the quality of the controller implementation, such as the system settling time and the integral absolute error.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu
{"title":"iAware: Interaction Aware Task Scheduling for Reducing Resource Contention in Mobile Systems","authors":"Yongchun Zheng, Changlong Li, Yi Xiong, Weihong Liu, Cheng Ji, Zongwei Zhu, Lichen Yu","doi":"10.1145/3609391","DOIUrl":"https://doi.org/10.1145/3609391","url":null,"abstract":"To ensure the user experience of mobile systems, the foreground application can be differentiated to minimize the impact of background applications. However, this article observes that system services in the kernel and framework layer, instead of background applications, are now the major resource competitors. Specifically, these service tasks tend to be quiet when people rarely interact with the foreground application and active when interactions become frequent, and this high overlap of busy times leads to contention for resources. This article proposes iAware, an interaction-aware task scheduling framework in mobile systems. The key insight is to make use of the previously ignored idle period and schedule service tasks to run at that period. iAware quantify the interaction characteristic based on the screen touch event, and successfully stagger the periods of frequent user interactions. With iAware, service tasks tend to run when few interactions occur, for example, when the device’s screen is turned off, instead of when the user is frequently interacting with it. iAware is implemented on real smartphones. Experimental results show that the user experience is significantly improved with iAware. Compared to the state-of-the-art, the application launching speed and frame rate are enhanced by 38.89% and 7.97% separately, with no more than 1% additional battery consumption.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpikeHard: Efficiency-Driven Neuromorphic Hardware for Heterogeneous Systems-on-Chip","authors":"Judicael Clair, Guy Eichler, Luca P. Carloni","doi":"10.1145/3609101","DOIUrl":"https://doi.org/10.1145/3609101","url":null,"abstract":"Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel
{"title":"Predictable GPU Wavefront Splitting for Safety-Critical Systems","authors":"Artem Klashtorny, Zhuanhao Wu, Anirudh Mohan Kaushik, Hiren Patel","doi":"10.1145/3609102","DOIUrl":"https://doi.org/10.1145/3609102","url":null,"abstract":"We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}