{"title":"Set associative address mapping to improve data throughput and reduce tail latency in SSDs","authors":"Aobo Yang, Jiaojiao Wu, Jiaxu Wu, Fan Yang, Zhibing Sha, Shiyu Zhong, Zhigang Cai, Jianwei Liao","doi":"10.1016/j.sysarc.2025.103445","DOIUrl":"10.1016/j.sysarc.2025.103445","url":null,"abstract":"<div><div>Solid State Drives (SSDs) have become the mainstream storage infrastructure across diverse computing systems. To access the data on the flash memory, a software component called Flash Translation Layer (FTL) is used to convert the logical address of an I/O request into the corresponding physical address, at the granularity of a page. This process is referred to as page-level address mapping in SSDs. An effective mapping method should fully utilize internal parallelism to maximize I/O throughput of the SSD device, while also paying attention to the long tail latency for guaranteeing user experience. Existing mapping approaches, however, have yet to effectively address both aspects simultaneously. Therefore, this paper proposes a <strong><u>s</u></strong>et <strong><u>a</u></strong>ssociative <strong><u>map</u></strong>ping approach, called <em><strong>SAMap</strong></em> to direct data allocation on the basis of static mapping, to improve data throughput and reducing long tail latency. Specifically, <em>SAMap</em> manages a number of channels into the granularity of <strong>set</strong>, and enables set associative mapping for data allocation. In the case of a write request being mapped to a specific channel by following the policy of static mapping, <em>SAMap</em> can forward it to any channel in the same set, by considering I/O workload balance across channels. Trace-driven experiments show that our proposal can enhance I/O data throughput by <span>36.0</span>% on average and cut down the tail latency by between <span>28.1</span>% and <span>57.0</span>%, at the <em>99.99th</em> percentile, in contrast to existing approaches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103445"},"PeriodicalIF":3.7,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hamun: An approximate computing method to prolong the lifespan of ReRAM-based accelerators","authors":"Mohammad Sabri, Marc Riera, Antonio González","doi":"10.1016/j.sysarc.2025.103444","DOIUrl":"10.1016/j.sysarc.2025.103444","url":null,"abstract":"<div><div>ReRAM-based accelerators exhibit enormous potential to increase computational efficiency for DNN inference tasks, delivering significant performance and energy savings over traditional platforms. By incorporating adaptive scheduling, these accelerators dynamically adjust to DNN requirements, optimizing allocation of constrained hardware resources. However, ReRAM cells have limited endurance cycles due to wear-out from multiple updates for each inference execution, which shortens the lifespan of ReRAM-based accelerators and presents a practical challenge in positioning them as alternatives to conventional platforms like TPUs. Addressing these endurance limitations is essential for making ReRAM-based solutions viable for long-term, high-performance DNN inference.</div><div>To address the lifespan limitations of ReRAM-based accelerators, we introduce <em>Hamun</em>, an approximate computing method designed to extend the lifespan of ReRAM-based accelerators through a range of optimizations. Hamun incorporates a novel mechanism that detects faulty cells due to wear-out and retires them, avoiding in this way their otherwise adverse impact on DNN accuracy. Moreover, Hamun extends the lifespan of ReRAM-based accelerators by adapting wear-leveling techniques across various abstraction levels of the accelerator and implementing a batch execution scheme to maximize ReRAM cell usage for multiple inferences. Additionally, Hamun introduces a new approximation method that leverages the fault tolerance characteristics of DNNs to delay the retirement of worn-out cells, reducing the performance penalty of retired cells and further extending the accelerator’s lifespan. On average, evaluated on a set of popular DNNs, Hamun demonstrates an improvement in lifespan of <span><math><mrow><mn>13</mn><mo>.</mo><mn>2</mn><mo>×</mo></mrow></math></span> over a state-of-the-art baseline. The main contributors to this improvement are the fault handling and batch execution schemes, which provide <span><math><mrow><mn>4</mn><mo>.</mo><mn>6</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>2</mn><mo>.</mo><mn>6</mn><mo>×</mo></mrow></math></span> lifespan improvements respectively.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103444"},"PeriodicalIF":3.7,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DNNPipe: Dynamic programming-based optimal DNN partitioning for pipelined inference on IoT networks","authors":"Woobean Seo , Saehwa Kim , Seongsoo Hong","doi":"10.1016/j.sysarc.2025.103462","DOIUrl":"10.1016/j.sysarc.2025.103462","url":null,"abstract":"<div><div>Pipeline parallelization is an effective technique that enables the efficient execution of deep neural network (DNN) inference on resource-constrained IoT devices. To enable pipeline parallelization across computing nodes with asymmetric performance profiles, interconnected via low-latency, high-bandwidth networks, we propose DNNPipe, a DNN partitioning algorithm that constructs a pipeline plan for a given DNN. The primary objective of DNNPipe is to maximize the throughput of DNN inference while minimizing the runtime overhead of DNN partitioning, which is repeatedly executed online in dynamically changing IoT environments. To achieve this, DNNPipe uses dynamic programming (DP) with pruning techniques that preserve optimality to explore the search space and find the optimal pipeline plan whose maximum stage time is no greater than that of any other possible pipeline plan. Specifically, it aggressively prunes suboptimal pipeline plans using two pruning techniques: <em>upper-bound-based pruning</em> and <em>under-utilized-stage pruning</em>. Our experimental results demonstrate that pipelined inference using an obtained optimal pipeline plan improves DNN throughput by up to 1.78 times compared to the highest performing single device and DNNPipe achieves up to 98.26 % lower runtime overhead compared to PipeEdge, the fastest known optimal DNN partitioning algorithm.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103462"},"PeriodicalIF":3.7,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contention-aware workflow scheduling on heterogeneous computing systems with shared buses","authors":"Yiming Zheng, Quanwang Wu, Kun Cai, Yunni Xia","doi":"10.1016/j.sysarc.2025.103434","DOIUrl":"10.1016/j.sysarc.2025.103434","url":null,"abstract":"<div><div>Heterogeneous computing systems (HCSs), which balance performance and efficiency by leveraging diverse computation resources, are widely used for executing workflow applications. These computation resources are typically interconnected through shared buses. When multiple participants simultaneously transmit data, communication contention arises, leading to longer communication time than expected. Nevertheless, the contention issue for the shared bus is rarely investigated in the literature. This paper proposes a Contention-Aware Clustering-based List scheduling (CACL) method to effectively address workflow scheduling in shared bus-based HCSs. In CACL, tasks are grouped into clusters based on criticality, clusters are mapped to computation resources, and then all tasks and edges in the workflow are scheduled to computation and communication resources in the HCS. Experimental results on realistic workflows demonstrate that CACL effectively addresses the communication contention issue, and reduces scheduling length by 5%–25% compared to existing methods, making it a robust solution for workflow scheduling in shared-bus-based HCSs.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103434"},"PeriodicalIF":3.7,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144083901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhao Tong , Xin Deng , Yuanyang Zhang , Jing Mei , Can Wang , Keqin Li
{"title":"MADDPG-based task offloading and resource pricing in edge collaboration environment","authors":"Zhao Tong , Xin Deng , Yuanyang Zhang , Jing Mei , Can Wang , Keqin Li","doi":"10.1016/j.sysarc.2025.103433","DOIUrl":"10.1016/j.sysarc.2025.103433","url":null,"abstract":"<div><div>With the rapid advancement of fifth-generation communication technologies, the data produced by the Internet of Everything is growing exponentially. As mobile cloud computing struggles to keep up with the demands for massive data processing and low latency, mobile edge computing (MEC) has emerged as a solution. By shifting services from centralized cloud platforms to edge servers located closer to data sources, MEC achieves reduced latency, enhanced computing efficiency, and an improved user experience. This paper introduces a task offloading algorithm designed for a multi-base station cooperative mobile edge environment, addressing the challenges of task offloading and resource pricing. The system architecture includes a macro base station and several micro base stations, strategically deployed in a densely populated mobile device area. Each mobile device serves as an autonomous decision-making unit, offloading tasks to an optimal base station. We model the interactions between base stations and end-users using a Stackelberg game approach, with strategy optimization achieved through a multi-agent deep deterministic policy gradient algorithm. The proposed TO-SG-MADDPG algorithm intelligently coordinates the policies of multiple base stations and end-users by centralized training and distributed execution, resulting in globally optimal task offloading and resource pricing. The results demonstrate that the proposed algorithm not only reduces the task loss rate but also safeguards the interests of all stakeholders.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103433"},"PeriodicalIF":3.7,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144089832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxuan Ma, Xiongqi Zhang, Ning Lv, Xiuqing Cao, Wang Lin, Zuohua Ding
{"title":"Data-driven barrier certificate generation using deep learning and symbolic regression","authors":"Xiaoxuan Ma, Xiongqi Zhang, Ning Lv, Xiuqing Cao, Wang Lin, Zuohua Ding","doi":"10.1016/j.sysarc.2025.103419","DOIUrl":"10.1016/j.sysarc.2025.103419","url":null,"abstract":"<div><div>Barrier certificate generation is an efficient and powerful technique for formally verifying the safety properties of cyber–physical systems. Neural networks are commonly used as the templates for barrier certificates, but the complex network structure makes it a challenge to verify the correctness of neural certificates. In this paper, we propose a novel data-driven framework that leverages deep learning and symbolic regression to synthesize barrier certificates in analytical form, with high efficiency and scalability. The framework is structured as an inductive loop with neural network training, distillation and verification. Specifically, a <em>Learner</em> leverages deep learning to train neural barrier candidates, which are then used as input for a <em>Distiller</em> to generate analytical barrier candidates via symbolic regression. Due to the simple analytical expressions, a <em>Verifier</em> then efficiently ensures the formal soundness of the analytical barrier candidates via an satisfiability modulo theories (SMT) solver, or generates counterexamples to further guide the <em>Learner</em>. We implement the tool <em>SR4BC</em>, and evaluate its performance over a set of benchmarks, which validates that <em>SR4BC</em> is much more efficient and effective than the state-of-the-art approaches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103419"},"PeriodicalIF":3.7,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAROS: Genetic algorithm-aided row-skipping for shift and duplicate kernel mapping in processing-in-memory architectures","authors":"Johnny Rhe , Kang Eun Jeon , Jong Hwan Ko","doi":"10.1016/j.sysarc.2025.103423","DOIUrl":"10.1016/j.sysarc.2025.103423","url":null,"abstract":"<div><div>Processing-in-memory (PIM) architecture is becoming a promising candidate for convolutional neural network (CNN) inference. A recent mapping method, shift and duplicate kernel (SDK), enhances latency by improving array utilization through shifting the same kernels into idle columns. Although pattern-based pruning effectively enables row-skipping, traditional pattern designs are suboptimal for SDK mapping due to the irregular kernel shifts, complicating row-skipping. To address this, we proposed pruning-aided row-skipping (PAIRS), which adopts SDK-optimized layer-wise patterns. However, PAIRS has two key limitations: it offers discrete row-skipping by using single pattern set, restricting precise control over the weight matrix compression for varying layer and array sizes, and it risks accuracy loss by pruning critical weights. To overcome these challenges, we introduce genetic algorithm-aided row-skipping (GAROS), which employs input channel (IC)-wise patterns. GAROS enables finer control over row-skipping by assigning several pattern sets and selecting optimal patterns to each IC for preserving critical weights. Consequently, this approach enables continuous weight matrix compression while balancing the trade-off between row-skipping and accuracy. Simulation results in WRN16-4 demonstrate that GAROS improved accuracy by up to +2.4% compared to PAIRS and achieved up to a 1.74<span><math><mo>×</mo></math></span> speedup compared to baseline when 128 × 128 sub-array is used.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103423"},"PeriodicalIF":3.7,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143947690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lening Wang , Qiyu Wan , Jing Wang , Mingsong Chen , Lu Peng , Xin Fu
{"title":"NANI: Energy-efficient Neuron-Aware hardware Noise Injection for adversarial defense using undervolting","authors":"Lening Wang , Qiyu Wan , Jing Wang , Mingsong Chen , Lu Peng , Xin Fu","doi":"10.1016/j.sysarc.2025.103424","DOIUrl":"10.1016/j.sysarc.2025.103424","url":null,"abstract":"<div><div>Convolutional Neural Networks (CNNs) are susceptible to adversarial attacks. A recent defense approach involves adding random noise to adversarial images, which can help CNNs mitigate adversarial impact. However, existing noise-injection defenses often reduce accuracy on benign images. Noticing that different neurons tolerate varying noise levels, we propose a neuron-aware noise injection scheme that accounts for neurons’ significance. This approach aims to defend against adversarial attacks while preserving benign accuracy. On the other side, undervolting is one of the techniques to generate noises , and meanwhile achieve energy savings. In this work, we have noticed that different processing elements (PEs) exhibit varying hardware error rates even when subjected to the same undervolting voltage level. By appropriately mapping specific neurons to specific PEs, we not only facilitate the implementation of our neuron-aware noise injection scheme on hardware, but we can also aggressively improve the energy efficiency. Finally, we present our vulnerable PE-enabled Neuron-Aware undervolting Noise Injection (NANI) scheme, which aims to defend against adversarial attacks by identifying and leveraging these vulnerable PEs to produce proper noise to proper neurons. Implementing NANI on FPGA, we achieve a 74% correction rate on adversarial examples and 33% energy savings with negligible accuracy drop on benign images.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103424"},"PeriodicalIF":3.7,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data clustering on hybrid classical-quantum NISQ architecture with generative-based variational and parallel algorithms","authors":"Julien Rauch , Damien Rontani , Stéphane Vialle","doi":"10.1016/j.sysarc.2025.103431","DOIUrl":"10.1016/j.sysarc.2025.103431","url":null,"abstract":"<div><div>Clustering is a well-established unsupervised machine-learning approach to classify data automatically. In large datasets, the classical version of such algorithms performs well only if significant computing resources are available (e.g., GPU). A distinct computational framework compared to classical methods relies on integrating a <em>quantum processing unit</em> (QPU) to alleviate the computing cost. This is achieved through the QPU’s ability to exploit quantum effects, such as superposition and entanglement, to natively parallelize computation or approximate multidimensional distributions for probabilistic computing (Born rule).</div><div>In this paper, we propose first a clustering algorithm adapted to a hybrid CPU–QPU architecture while considering the current limitations of <em>noisy intermediate-scale quantum</em> (NISQ) technology. Secondly, we propose a quantum algorithm that exploits the probabilistic nature of quantum physics to make the most of our QPU’s potential. Our approach leverage on ideas from generative machine-learning algorithm and <em>variational quantum algorithms</em> (VQA) to design an hybrid QPU–CPU algorithm based on a mixture of so-called <em>quantum circuits Born machines</em> (QCBM). We implemented and tested the quality of our algorithm on an IBM quantum machine, then parallelized it to make better use of quantum resources and speed up the execution of quantum-based clustering algorithms.</div><div>Finally, summarize the lessons learned from exploiting a CPU–QPU architecture on NISQ for data clustering.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103431"},"PeriodicalIF":3.7,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144071813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Breathing new life into compression: Resolving the dilemma of LFS with compression on flash storage","authors":"Yunpeng Song, Yiyang Huang, Dingcui Yu, Liang Shi","doi":"10.1016/j.sysarc.2025.103432","DOIUrl":"10.1016/j.sysarc.2025.103432","url":null,"abstract":"<div><div>State-of-the-art storage systems have widely adopted log-structured file systems (LFS) with unique append–write capability, making them ideal for supporting compression. Compression is a recognized way of reducing data-occupied space and extending the lifetime of flash. However, implementing file system-level compression faces a dilemma that hampers its adoption. Two significant issues are responsible for this. Firstly, the software stack overhead resulting from compression is costly. Due to its location on the critical path for reads and writes, compression will block the user’s I/O requests. Secondly, compressing as much space as possible to enjoy the benefits of compression in terms of space will inevitably introduce compression overhead. This paper proposes a novel no-critical path compression scheme that significantly eliminates compression’s current dilemma. The basic idea is to perform non-critical path compression, minimizing the performance impact and maximizing the benefits of compression in space by disengaging compression from the critical paths of reads and writes. To achieve this, a critical path detachment scheme is first proposed to detach the compression from the critical path based on the properties of the non-critical path compression. Furthermore, a contention-avoiding scheduling scheme is proposed to minimize the impact on CPU costs. Finally, a reserve space (RS)-oriented allocation scheme is proposed to exploit the benefits of compression in space to optimize the cleaning cost of LFS. Through careful design and evaluation on a real platform, we demonstrate that the proposed scheme, NCPC, achieves encouraging performance and lifetime optimizations compared to state-of-the-art solutions.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"165 ","pages":"Article 103432"},"PeriodicalIF":3.7,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}