{"title":"FSIMR: File-system-aware Data Management for Interlaced Magnetic Recording","authors":"Yi-Han Lien, Yen-Ting Chen, Yuan-Hao Chang, Yu-Pei Liang, Wei-Kuan Shih","doi":"10.1145/3607922","DOIUrl":"https://doi.org/10.1145/3607922","url":null,"abstract":"Interlaced Magnetic Recording (IMR) is an emerging recording technology for hard-disk drives (HDDs) that provides larger storage capacity at a lower cost. By partially overlapping (interlacing) each bottom track with two adjacent top tracks, IMR-based HDDs successfully increase the data density while incurring some hardware write constraints. To update each bottom track, the data on two adjacent top tracks must be read and rewritten to avoid losing their valid data, resulting in additional overhead for performing read-modify-write (RMW) operations. Therefore, researchers have proposed various data management schemes to mitigate such overhead in recent years, aiming at improving the write performance. However, these designs have not taken into account the data characteristics of the file system, which is a crucial layer of operating systems for storing/retrieving data into/from HDDs. Consequently, the write performance improvement is limited due to the unawareness of spatial locality and hotness of data. This paper proposes a file-system-aware data management scheme called FSIMR to improve system write performance. Noticing that data of the same directory may have higher spatial locality and are mostly updated at the same time, FSIMR logically partitions the IMR-based HDD into fixed-sized zones; data belonging to the same directory will be arranged to one zone to reduce the time of seeking to-be-updated data (seek time). Furthermore, cold data within a zone are arranged to bottom tracks and updated in an out-of-place manner to eliminate RMW operations. Our experimental results show that the proposed FSIMR could reduce the seek time by up to 14% without introducing additional RMW operations, compared to existing designs.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza
{"title":"Overflow-free Compute Memories for Edge AI Acceleration","authors":"Flavio Ponzina, Marco Rios, Alexandre Levisse, Giovanni Ansaloni, David Atienza","doi":"10.1145/3609387","DOIUrl":"https://doi.org/10.1145/3609387","url":null,"abstract":"Compute memories are memory arrays augmented with dedicated logic to support arithmetic. They support the efficient execution of data-centric computing patterns, such as those characterizing Artificial Intelligence (AI) algorithms. These architectures can provide computing capabilities as part of the memory array structures (In-Memory Computing, IMC) or at their immediate periphery (Near-Memory Computing, NMC). By bringing the processing elements inside (or very close to) storage, compute memories minimize the cost of data access. Moreover, highly parallel (and, hence, high-performance) computations are enabled by exploiting the regular structure of memory arrays. However, the regular layout of memory elements also constrains the data range of inputs and outputs, since the bitwidths of operands and results stored at each address cannot be freely varied. Addressing this challenge, we herein propose a HW/SW co-design methodology combining careful per-layer quantization and inter-layer scaling with lightweight hardware support for overflow-free computation of dot-vector operations. We demonstrate their use to implement the convolutional and fully connected layers of AI models. We embody our strategy in two implementations, based on IMC and NMC, respectively. Experimental results highlight that an area overhead of only 10.5% (for IMC) and 12.9% (for NMC) is required when interfacing with a 2KB subarray. Furthermore, inferences on benchmark CNNs show negligible accuracy degradation due to quantization for equivalent floating-point implementations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks","authors":"Harsh Sharma, Lukas Pfromm, Rasit Onur Topaloglu, Janardhan Rao Doppa, Umit Y. Ogras, Ananth Kalyanraman, Partha Pratim Pande","doi":"10.1145/3608098","DOIUrl":"https://doi.org/10.1145/3608098","url":null,"abstract":"Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CABARRE: Request Response Arbitration for Shared Cache Management","authors":"Garima Modi, Aritra Bagchi, Neetu Jindal, Ayan Mandal, Preeti Ranjan Panda","doi":"10.1145/3608096","DOIUrl":"https://doi.org/10.1145/3608096","url":null,"abstract":"Modern multi-processor systems-on-chip (MPSoCs) are characterized by caches shared by multiple cores. These shared caches receive requests issued by the processor cores. Requests that are subject to cache misses may result in the generation of responses . These responses are received from the lower level of the memory hierarchy and written to the cache. The outstanding requests and responses contend for the shared cache bandwidth. To mitigate the impact of the cache bandwidth contention on the overall system performance, an efficient request and response arbitration policy is needed. Research on shared cache management has neglected the additional cache contention caused by responses, which are written to the cache. We propose CABARRE , a novel request and response arbitration policy at shared caches, so as to improve the overall system performance. CABARRE shows a performance improvement of 23% on average across a set of SPEC workloads compared to straightforward adaptations of state-of-the-art solutions.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio
{"title":"<scp>Hephaestus</scp> : Codesigning and Automating 3D Image Registration on Reconfigurable Architectures","authors":"Giuseppe Sorrentino, Marco Venere, Davide Conficconi, Eleonora D’Arnese, Marco Domenico Santambrogio","doi":"10.1145/3607928","DOIUrl":"https://doi.org/10.1145/3607928","url":null,"abstract":"Healthcare is a pivotal research field, and medical imaging is crucial in many applications. Therefore finding new architectural and algorithmic solutions would benefit highly repetitive image processing procedures. One of the most complex tasks in this sense is image registration, which finds the optimal geometric alignment among 3D image stacks and is widely employed in healthcare and robotics. Given the high computational demand of such a procedure, hardware accelerators are promising real-time and energy-efficient solutions, but they are complex to design and integrate within software pipelines. Therefore, this work presents an automation framework called Hephaestus that generates efficient 3D image registration pipelines combined with reconfigurable accelerators. Moreover, to alleviate the burden from the software, we codesign software-programmable accelerators that can adapt at run-time to the image volume dimensions. Hephaestus features a cross-platform abstraction layer that enables transparently high-performance and embedded systems deployment. However, given the computational complexity of 3D image registration, the embedded devices become a relevant and complex setting being constrained in memory; thus, they require further attention and tailoring of the accelerators and registration application to reach satisfactory results. Therefore, with Hephaestus , we also propose an approximation mechanism that enables such devices to perform the 3D image registration and even achieve, in some cases, the accuracy of the high-performance ones. Overall, Hephaestus demonstrates 1.85× of maximum speedup, 2.35× of efficiency improvement with respect to the State of the Art, a maximum speedup of 2.51× and 2.76× efficiency improvements against our software, while attaining state-of-the-art accuracy on 3D registrations.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ViT4Mal: Lightweight Vision Transformer for Malware Detection on Edge Devices","authors":"Akshara Ravi, Vivek Chaturvedi, Muhammad Shafique","doi":"10.1145/3609112","DOIUrl":"https://doi.org/10.1145/3609112","url":null,"abstract":"There has been a tremendous growth of edge devices connected to the network in recent years. Although these devices make our life simpler and smarter, they need to perform computations under severe resource and energy constraints, while being vulnerable to malware attacks. Once compromised, these devices are further exploited as attack vectors targeting critical infrastructure. Most existing malware detection solutions are resource and compute-intensive and hence perform poorly in protecting edge devices. In this paper, we propose a novel approach ViT4Mal that utilizes a lightweight vision transformer (ViT) for malware detection on an edge device. ViT4Mal first converts executable byte-code into images to learn malware features and later uses a customized lightweight ViT to detect malware with high accuracy. We have performed extensive experiments to compare our model with state-of-the-art CNNs in the malware detection domain. Experimental results corroborate that ViTs don’t demand deeper networks to achieve comparable accuracy of around 97% corresponding to heavily structured CNN models. We have also performed hardware deployment of our proposed lightweight ViT4Mal model on the Xilinx PYNQ Z1 FPGA board by applying specialized hardware optimizations such as quantization, loop pipelining, and array partitioning. ViT4Mal achieved an accuracy of ~94% and a 41x speedup compared to the original ViT model.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proactive Stripe Reconstruction to Improve Cache Use Efficiency of SSD-Based RAID Systems","authors":"Zhibing Sha, Jiaojiao Wu, Jun Li, Balazs Gerofi, Zhigang Cai, Jianwei Liao","doi":"10.1145/3609099","DOIUrl":"https://doi.org/10.1145/3609099","url":null,"abstract":"Solid-State Drives (SSDs) exhibit different failure characteristics compared to conventional hard disk drives. In particular, the Bit Error Rate (BER) of an SSD increases as it bears more writes. Then, Parity-based Redundant Array of Inexpensive Disks (RAID) arrays composed from SSDs are introduced to address correlated failures. In the RAID-5 implementation, specifically, the process of parity generation (or update) associating with a data stripe, consists of read and write operations to the SSDs. Whenever a new update request comes to the RAID system, the related parity must be also updated and flushed onto the RAID component of SSD. Such frequent parity updates result in poor RAID performance and shorten the life-time of the SSDs. Consequently, a DRAM cache is commonly equipped accompanying with the RAID controller, called the parity cache, and used to buffer the parity chunks that are most frequently updated data, for boosting I/O performance. To better improve the use efficiency of the parity cache, this paper proposes a stripe reconstruction approach to minimize the number of parity updates on SSDs, thus boosting I/O performance of the SSD RAID system. When the currently updated stripe has both cold and hot updated data chunks, it will proactively carry out stripe reconstruction if we can find another matched stripe that also includes cold and hot update data chunks on the complementary RAID components. In the reconstruction process, we first group the cold data chunks of two matched stripes, to build a new stripe and flush the parity chunk on the RAID component. After that, the hot data chunks are organized as a new stripe as well, and its parity chunk is buffered in the parity cache. This results in better cache use efficiency, as it can reduce the number of parity updates on RAID components of SSDs, as well as proactively free up cache space for quickly absorbing subsequent write requests. In addition, the proposed method adjusts the target SSD of write requests based on stripe reconstructions through considering the I/O workload balance of all SSDs. Experimental results show that our proposal can reduce the number of parity chunk updates in SSDs by 2.3% and overall I/O latency by 12.2% on average, compared to state-of-the-art parity cache management techniques.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Methods to Realize Preemption in Phased Execution Models","authors":"Thilanka Thilakasiri, Matthias Becker","doi":"10.1145/3609132","DOIUrl":"https://doi.org/10.1145/3609132","url":null,"abstract":"Phased execution models are a good solution to tame the increased complexity and contention of commercial off-the-shelf (COTS) multi-core platforms, e.g., Acquisition-Execution-Restitution (AER) model, PRedictable Execution Model (PREM). Such models separate execution from access to shared resources on the platform to minimize contention. All data and instructions needed during an execution phase are copied into the local memory of the core before starting to execute. Phased execution models are generally used with non-preemptive scheduling to increase predictability. However, the blocking time in non-preemptive systems can reduce schedulability. Therefore, an investigation of preemption methods for phased execution models is warranted. Although, preemption for phased execution models must be carefully designed to retain its execution semantics, i.e., the handling of local memory during preemption becomes non-trivial. This paper investigates different methods to realize preemption in phased execution models while preserving their semantics. To the best of our knowledge, this is the first paper to explore different approaches to implement preemption in phased execution models from the perspective of data management. We introduce two strategies to realize preemption of execution phases based on different methods of handling local data of the preempted task. Heuristics are used to create time-triggered schedules for task sets that follow the proposed preemption methods. Additionally, a schedulability-aware preemption heuristic is proposed to reduce the number of preemptions by allowing preemption only when it is beneficial in terms of schedulability. Evaluations on a large number of synthetic task sets are performed to compare the proposed preemption models against each other and against a non-preemptive version. Furthermore, our schedulability-aware preemption heuristic has higher schedulability with a clear margin in all our experiments compared to the non-preemptive and fully-preemptive versions.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"<scp>ObNoCs</scp> : Protecting Network-on-Chip Fabrics Against Reverse-Engineering Attacks","authors":"Dipal Halder, Maneesh Merugu, Sandip Ray","doi":"10.1145/3609107","DOIUrl":"https://doi.org/10.1145/3609107","url":null,"abstract":"Modern System-on-Chip designs typically use Network-on-Chip (NoC) fabrics to implement coordination among integrated hardware blocks. An important class of security vulnerabilities involves a rogue foundry reverse-engineering the NoC topology and routing logic. In this paper, we develop an infrastructure, ObNoCs , for protecting NoC fabrics against such attacks. ObNoCs systematically replaces router connections with switches that can be programmed after fabrication to induce the desired topology. Our approach provides provable redaction of NoC functionality: switch configurations induce a large number of legal topologies, only one of which corresponds to the intended topology. We implement the ObNoCs methodology on Intel Quartus™ Platform, and experimental results on realistic SoC designs show that the architecture incurs minimal overhead in power, resource utilization, and system latency.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136192828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke
{"title":"BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural Networks","authors":"Yunjie Pan, Jiecao Yu, Andrew Lukefahr, Reetuparna Das, Scott Mahlke","doi":"10.1145/3609093","DOIUrl":"https://doi.org/10.1145/3609093","url":null,"abstract":"Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across a wide range of machine learning tasks. However, the high accuracy usually comes at the cost of substantial computation and energy consumption, making it difficult to be deployed on mobile and embedded devices. In CNNs, the compute-intensive convolutional layers are usually followed by a ReLU activation layer, which clamps negative outputs to zeros, resulting in large activation sparsity. By exploiting such sparsity in CNN models, we propose a software-hardware co-design BitSET, that aggressively saves energy during CNN inference. The bit-serial BitSET accelerator adopts a prediction-based bit-level early termination technique that terminates the ineffectual computation of negative outputs early. To assist the algorithm, we propose a novel weight encoding that allows more accurate predictions with fewer bits. BitSET leverages the bit-level computation reduction both in the predictive early termination algorithm and in the non-predictive, energy-efficient bit-serial architecture. Compared to UNPU, an energy-efficient bit-serial CNN accelerator, BitSET yields an average 1.5× speedup and 1.4× energy efficiency improvement with no accuracy loss due to a 48% reduction in bit-level computations. Relaxing the allowed accuracy loss to 1% increases the gains to an average of 1.6× speedup and 1.4× energy efficiency improvement.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136108458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}