Pavia Bera, Stephen Cahoon, Sanjukta Bhanja, Alex Jones
{"title":"SPIMulator: A Spintronic Processing-In-Memory Simulator for Racetracks","authors":"Pavia Bera, Stephen Cahoon, Sanjukta Bhanja, Alex Jones","doi":"10.1145/3645112","DOIUrl":"https://doi.org/10.1145/3645112","url":null,"abstract":"<p>In-memory processing is becoming a popular method to alleviate the memory bottleneck of the von Neumann computing model. With the goal of improving both latency and energy cost associated with such in-memory processing, emerging non-volatile memory technologies, such as Spintronic magnetic memory, are of particular interest as they can provide a near-SRAM read/write performance and eliminate nearly all static energy without experiencing any endurance limitations. Spintronic Racetrack Memory (RM) further addresses density concerns of spin-transfer torque memory (STT-MRAM). Moreover, it has recently been demonstrated that portions of RM nanowires can function as a polymorphic gate, which can be leveraged to implement multi-operand bulk bitwise operations. With more complex control, they can also be leveraged to build arithmetic integer and floating point processing in memory (PIM) primitives. This paper proposes SPIMulator, a Spintronic PIM sim<i>ulator</i> that can simulate the storage and PIM architecture of executing PIM commands in Racetrack memory. SPIMulator functionally models the polymorphic gate properties recently proposed for Racetrack memory, which allows <i>transverse access</i> that determines the number of ‘1’s in a segment of each Racetrack nanowire. From this simulation, SPIMulator can report real-time performance statistics such as cycle count and energy. Thus, SPIMulator simulates the multi-operand bit-wise logic operations recently proposed and can be easily extended to implement new PIM operations as they are developed. Due to the functional nature of SPIMulator, it can serve as a programming environment that allows development of PIM-based codes for verification of new acceleration algorithms. We demonstrate the value of SPIMulator through the modeling and estimations of performance and energy consumption of a variety of example applications, including the Advanced Encryption Standard (AES) for encryption primarily based on logical and look-up operations; multiplication of matrices, a frequent requirement in scientific, signal processing, and machine learning algorithms; and bitmap indices a common search table employed for database lookups.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139754352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STDF: Spatio-Temporal Deformable Fusion for Video Quality Enhancement on Embedded Platforms","authors":"Jianing Deng, Shunjie Dong, Lvcheng Chen, Jingtong Hu, Cheng Zhuo","doi":"10.1145/3645113","DOIUrl":"https://doi.org/10.1145/3645113","url":null,"abstract":"<p>With the development of embedded systems and deep learning, it is feasible to combine them for offering various and convenient human-centered services, which is based on high-quality (HQ) videos. However, due to the limit of video traffic load and unavoidable noise, the visual quality of an image from an edge camera may degrade significantly, influencing the overall video and service quality. To maintain video stability, video quality enhancement (QE), aiming at recovering high-quality (HQ) videos from their distorted low-quality (LQ) sources, has aroused increasing attention in recent years. The key challenge for video quality enhancement lies in how to effectively aggregate complementary information from multiple frames (i.e., temporal fusion). To handle diverse motion in videos, existing methods commonly apply motion compensation before the temporal fusion. However, the motion field estimated from the distorted LQ video tends to be inaccurate and unreliable, thereby resulting in ineffective fusion and restoration. In addition, motion estimation for consecutive frames is generally conducted in a pairwise manner, which leads to expensive and inefficient computation. In this paper, we propose a fast yet effective temporal fusion scheme for video QE by incorporating a novel Spatio-Temporal Deformable Convolution (STDC) to simultaneously compensate motion and aggregate temporal information. Specifically, the proposed temporal fusion scheme takes a target frame along with its adjacent reference frames as input to jointly estimate an offset field to deform the spatio-temporal sampling positions of convolution. As a result, complementary information from multiple frames can be fused within the STDC operation in one forward pass. Extensive experimental results on three benchmark datasets show that our method performs favorably to the state-of-the-arts in terms of accuracy and efficiency.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139773088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Space-Grained Cleaning Method to Reduce Long-Tail Latency of DM-SMR Disks","authors":"Chin-Hsien Wu, Cheng-Tze Lee, Yi-Ren Tsai, Cheng-Yen Wu","doi":"10.1145/3643827","DOIUrl":"https://doi.org/10.1145/3643827","url":null,"abstract":"<p>DM-SMR (device-managed shingled magnetic recording) disks allocate a portion of disk space as the persistent cache (PC) to address the issue of overlapping tracks during data updates. When the PC space becomes insufficient, a space cleaning is triggered to reclaim its invalid space. However, the space cleaning is time-consuming and contributes to the long-tail latency of DM-SMR disks. In the paper, we will propose a space-grained cleaning method that leverages various idle periods to effectively reduce the long-tail latency of DM-SMR disks. The objective is to perform a proper space-grained cleaning for a suitable space region at an appropriate time period, thereby preventing delays in subsequent I/O requests and reducing the long-tail latency associated with DM-SMR disks. The experimental results demonstrate a substantial reduction in the long-tail latency of DM-SMR disks through the proposed method.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"99 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139754113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lu Li, Qi Tian, Guofeng Qin, Shuaiyu Chen, Weijia Wang
{"title":"Compact Instruction Set Extensions for Dilithium","authors":"Lu Li, Qi Tian, Guofeng Qin, Shuaiyu Chen, Weijia Wang","doi":"10.1145/3643826","DOIUrl":"https://doi.org/10.1145/3643826","url":null,"abstract":"<p>Post-quantum cryptography is considered to provide security against both traditional and quantum computer attacks. Dilithium is a digital signature algorithm that derives its security from the challenge of finding short vectors in lattices. It has been selected as one of the standardizations in the NIST post-quantum cryptography project. Hardware-software co-design is a commonly adopted implementation strategy to address various implementation challenges, including limited resources, high performance, and flexibility requirements. In this study, we investigate using compact instruction set extensions (ISEs) for Dilithium, aiming to improve software efficiency with low hardware overheads. To begin with, we propose tightly coupled accelerators that are deeply integrated into the RISC-V processor. These accelerators target the most computationally demanding components in resource-constrained processors, such as polynomial generation, Number Theoretic Transform (NTT), and modular arithmetic. Next, we design a set of custom instructions that seamlessly integrate with the RISC-V base instruction formats, completing the accelerators in a compact manner. Subsequently, we implement our ISEs in a chip design for the Hummingbird E203 core and conduct performance benchmarks for Dilithium utilizing these ISEs. Additionally, we evaluate the resource consumption of the ISEs on FPGA and ASIC technologies. Compared to the reference software implementation on the RISC-V core, our co-design demonstrates a remarkable speedup factor ranging from 6.95 to 9.96. This significant improvement in performance is achieved by incorporating additional hardware resources, specifically, a (35% ) increase in LUTs, a (14% ) increase in FFs, 7 additional DSPs, and no additional RAM. Furthermore, compared to the state-of-the-art approach, our work achieves faster speed performance with a reduced circuit cost. Specifically, the usage of additional LUTs, FFs, and RAMs is reduced by (47.53% ), (50.43% ), and (100% ), respectively. On ASIC technology, our approach demonstrates 12 412 cell counts. Our co-design provides a better trade-off implementation on speed performance and circuit overheads.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"1 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139669646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
George Kornaros, Svoronos Leivadaros, Filippos Kolimbianakis
{"title":"Flexible Updating of Internet of Things Computing Functions through Optimizing Dynamic Partial Reconfiguration","authors":"George Kornaros, Svoronos Leivadaros, Filippos Kolimbianakis","doi":"10.1145/3643825","DOIUrl":"https://doi.org/10.1145/3643825","url":null,"abstract":"<p>With applications to become increasingly compute- and data-intensive requiring more processing power, many internet-of-things (IoT) platforms in robots, drones, and autonomous vehicles which implement neural network inference, cryptographic functions or signal processing (e.g., multimedia, communication), employ field programmable gate arrays (FPGAs). At the same time, dynamic partial reconfiguration (DPR) in modern FPGAs enable changing the function of a part of the FPGA by dynamically loading new bitstreams to the logic regions without affecting the function of other parts of the FPGA. This is especially useful, to update functions of IoT devices while in operation, for bug fixing or functionality adjustments, and more importantly when these IoT devices integrate low-cost FPGAs that can hardly realize many hard accelerators. To deal with one of the major limitations of using partial reconfiguration in IoT devices, this work introduces techniques to flexibly use DPR, namely FLEXDPR, by sharing reconfigurable partitions among different accelerator functions and by supporting virtual relocation of these functions. Experimental results on the Xilinx ZYNQ-7000 platform reveal energy and latency efficiency improvements of, on average, about 20%. Overall, the suggested approach can reduce partial reconfiguration overhead while easing the scheduler’s decisions for the deployment of hardware functions throughout time and space in a performance-conscious manner.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"6 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139656828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Customized FPGA Implementation of Authenticated Lightweight Cipher Fountain for IoT Systems","authors":"Zhengyuan Shi, Cheng Chen, Gangqiang Yang, Hongchao Zhou, Hailiang Xiong, Zhiguo Wan","doi":"10.1145/3643039","DOIUrl":"https://doi.org/10.1145/3643039","url":null,"abstract":"<p>Authenticated Encryption with Associated-Data (AEAD) can ensure both confidentiality and integrity of information in encrypted communication. Distinctive variants are customized from AEAD to satisfy various requirements. In this paper, we take a 128-bit lightweight AEAD stream cipher Fountain as an example. We provide a general cryptographic solution with three Fountain variants. These three variants are for encryption, message authentication code (MAC) generation, and authenticated encryption with associated data, respectively. Besides, we propose area-saved and throughput-improved strategies for the FPGA implementation of Fountain. The conventional paralleled hardware implementation leads to much resource-consuming with higher parallel width. We propose a hybrid architecture with parallel and serial update modes simultaneously. We also analyze the trade-off between area occupation and authentication latency for those two architectures. According to our discussion, hybrid architectures can perform efficiently with higher throughput than most ciphers, including Grain-128 x32. Our Fountain keystream generator occupies 46 slices on Spartan-3 FPGAs, smaller than most ciphers with the same security level, and even smaller than the 80-bit security level cipher Trivium. In summary, the customized Fountain with optimized implementations on FPGA is suitable for various applications in the field of IoT.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"151 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intelligent Caching for Vehicular Dew Computing in Poor Network Connectivity Environments","authors":"Liang Zhao, Hongxuan Li, Enchao Zhang, Ammar Hawbani, Mingwei Lin, Shaohua Wan, Mohsen Guizani","doi":"10.1145/3643038","DOIUrl":"https://doi.org/10.1145/3643038","url":null,"abstract":"<p>In vehicular networks, some edge servers may not function properly due to the time-varying load condition and the uneven computing resource distribution, resulting in a low quality of caching services. To overcome this challenge, we develop a Vehicular dew computing (VDC) architecture for the first time by combining dew computing with vehicular networks, which can achieve wireless communication between vehicles in a resource-constrained environment. Consequently, it is crucial to develop an adaptive caching scheme that empowers vehicles to form efficient cooperation in VDC. In this paper, we propose an intelligent caching scheme based on VDC architecture, which includes two parts. First, to meet the dynamic nature of VDC, a spatiotemporal vehicle clustering algorithm is proposed to establish adaptive cooperation to assist content caching for vehicles. Second, the multi-armed bandit algorithm is employed to select suitable content for caching in vehicles based on real-time file popularity, and a model is established to dynamically update each vehicle’s request preferences. Extensive experiments are conducted to demonstrate that the proposed scheme has excellent performance in terms of cluster head stability and cache hit rate.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"76 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139553715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PolyARBerNN: A Neural Network Guided Solver and Optimizer for Bounded Polynomial Inequalities","authors":"Wael Fatnassi, Yasser Shoukry","doi":"10.1145/3632970","DOIUrl":"https://doi.org/10.1145/3632970","url":null,"abstract":"<p>Constraints solvers play a significant role in the analysis, synthesis, and formal verification of complex cyber-physical systems. In this paper, we study the problem of designing a scalable constraints solver for an important class of constraints named polynomial constraint inequalities (also known as nonlinear real arithmetic theory). In this paper, we introduce a solver named PolyARBerNN that uses convex polynomials as abstractions for highly nonlinears polynomials. Such abstractions were previously shown to be powerful to prune the search space and restrict the usage of sound and complete solvers to small search spaces. Compared with the previous efforts on using convex abstractions, PolyARBerNN provides three main contributions namely (i) a neural network guided abstraction refinement procedure that helps selecting the right abstraction out of a set of pre-defined abstractions, (ii) a Bernstein polynomial-based search space pruning mechanism that can be used to compute tight estimates of the polynomial maximum and minimum values which can be used as an additional abstraction of the polynomials, and (iii) an optimizer that transforms polynomial objective functions into polynomial constraints (on the gradient of the objective function) whose solutions are guaranteed to be close to the global optima. These enhancements together allowed the PolyARBerNN solver to solve complex instances and scales more favorably compared to the state-of-art nonlinear real arithmetic solvers while maintaining the soundness and completeness of the resulting solver. In particular, our test benches show that PolyARBerNN achieved 100X speedup compared with Z3 8.9, Yices 2.6, and PVS (a solver that uses Bernstein expansion to solve multivariate polynomial constraints) on a variety of standard test benches. Finally, we implemented an optimizer called PolyAROpt that uses PolyARBerNN to solve constrained polynomial optimization problems. Numerical results show that PolyAROpt is able to solve high-dimensional and high order polynomial optimization problems with higher speed compared to the built-in optimizer in the Z3 8.9 solver.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"7 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139553761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adversarial Transferability in Embedded Sensor Systems: An Activity Recognition Perspective","authors":"Ramesh Kumar Sah, Hassan Ghasemzadeh","doi":"10.1145/3641861","DOIUrl":"https://doi.org/10.1145/3641861","url":null,"abstract":"<p>Machine learning algorithms are increasingly used for inference and decision-making in embedded systems. Data from sensors are used to train machine learning models for various smart functions of embedded and cyber-physical systems ranging from applications in healthcare, autonomous vehicles, and national security. However, recent studies have shown that machine learning models can be fooled by adding adversarial noise to their inputs. The perturbed inputs are called adversarial examples. Furthermore, adversarial examples designed to fool one machine learning system are also often effective against another system. This property of adversarial examples is called <i>adversarial transferability</i> and has not been explored in wearable systems to date. In this work, we take the first stride in studying adversarial transferability in wearable sensor systems from four viewpoints: (1) transferability between machine learning models; (2) transferability across users/subjects of the embedded system; (3) transferability across sensor body locations; and (4) transferability across datasets used for model training. We present a set of carefully designed experiments to investigate these transferability scenarios. We also propose a threat model describing the interactions of an adversary with the source and target sensor systems in different transferability settings. In most cases, we found high untargeted transferability, whereas targeted transferability success scores varied from (0% ) to (80% ). The transferability of adversarial examples depends on many factors such as the inclusion of data from all subjects, sensor body position, number of samples in the dataset, type of learning algorithm, and the distribution of source and target system dataset. The transferability of adversarial examples decreased sharply when the data distribution of the source and target system became more distinct. We also provide guidelines and suggestions for the community for designing robust sensor systems. Code and dataset used in our analysis is publicly available here.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"53 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arwa Alsubhi, Simeon Babatunde, Nicole Tobias, Jacob Sorber
{"title":"Stash: Flexible Energy Storage for Intermittent Sensors","authors":"Arwa Alsubhi, Simeon Babatunde, Nicole Tobias, Jacob Sorber","doi":"10.1145/3641511","DOIUrl":"https://doi.org/10.1145/3641511","url":null,"abstract":"<p>Batteryless sensors promise a sustainable future for sensing, but they face significant challenges when storing and using environmental energy. Incoming energy can fluctuate unpredictably between periods of scarcity and abundance, and device performance depends on both incoming energy and how much a device can store. Existing batteryless devices have used fixed or run-time selectable front-end capacitor banks to meet the energy needs of different tasks. Neither approach adapts well to rapidly changing energy harvesting conditions, nor does it allow devices to store excess energy during times of abundance without sacrificing performance. </p><p>This paper presents Stash, a hardware back-end energy storage technique that allows batteryless devices to charge quickly and store excess energy when it is abundant, extending their operating time and carrying out additional tasks without compromising the main ones. Stash performs like a small capacitor device when small capacitors excel and like a large capacitor device when large capacitors excel, with no additional software complexity and negligible power overhead. We evaluate Stash using two applications—temperature sensing and wearable activity monitoring—under both synthetic solar energy and recorded solar and thermal traces from various human activities. Our results show that Stash increased sensor coverage by up to 15% under variable energy-harvesting conditions when compared to competitor configurations that used fixed small, large, and reconfigurable front-end energy storage.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}