Xiaozhu Song , Qianpiao Ma , Gan Zheng , Liying Li , Peijin Cong , Junlong Zhou
{"title":"Dynamic task offloading and resource allocation for energy-harvesting end–edge–cloud computing systems","authors":"Xiaozhu Song , Qianpiao Ma , Gan Zheng , Liying Li , Peijin Cong , Junlong Zhou","doi":"10.1016/j.sysarc.2025.103469","DOIUrl":"10.1016/j.sysarc.2025.103469","url":null,"abstract":"<div><div>In end–edge–cloud (EEC) computing, end devices (EDs) offload compute-intensive tasks to nearby edge servers or the cloud server to alleviate processing burdens and enable a flexible computing architecture. However, resource constraints and dynamic environments pose significant challenges for EEC task offloading and resource allocation, including real-time requirements, unreliable task execution, and limited battery energy, especially in energy harvesting (EH) systems, in which battery energy remains unstable due to its inherent fluctuations. Existing task offloading and resource allocation approaches often fail to address these challenges holistically, leading to degraded performance and potential task execution failures. In this paper, we propose a novel task offloading and resource allocation method for EH EEC computing, aiming to optimize long-term performance by minimizing delay and energy consumption while ensuring task execution reliability and battery energy stability. Specifically, we formulate task offloading and resource allocation as a cost optimization problem under constraints such as ED capacity, task reliability, and energy consumption. To solve this problem, we first leverage Lyapunov optimization to decouple the original time-dependent problem. Then we derive optimal closed-form solutions for computation and transmission power resource allocation. Based on these solutions, we propose a multiple discrete particle swarm optimization algorithm to determine task offloading decision. Extensive experiments demonstrate the superiority of our method in balancing delay, execution reliability, and energy stability under varying conditions.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103469"},"PeriodicalIF":3.7,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144291268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based CNN accelerator using Convolutional Processing Element to reduce idle states","authors":"Mohammad Dehnavi , Aran Ghasemi , Bijan Alizadeh","doi":"10.1016/j.sysarc.2025.103468","DOIUrl":"10.1016/j.sysarc.2025.103468","url":null,"abstract":"<div><div>Object detection has been a significant challenge in machine vision systems from the past to the present. Various hardware-based accelerators have been utilized to enhance speed efficiency. The primary objective of most of these accelerators is to minimize idle states in DSP blocks. In this paper, a new architecture based on Convolutional Processing Elements (CPEs) is proposed, wherein weights are stored, circularly shifted in an internal CPE buffer and used to generate output feature maps. In this way, the idle states of DSPs are reduced by increasing data reuse in CPEs and decreasing external memory accesses. The number of CPEs used to accelerate a CNN depends on the required speed and available hardware resources; configurations of 16, 32, 64, 128, and 256 CPEs can be utilized to accelerate a desired Convolutional Neural Network (CNN). To demonstrate the effectiveness of the proposed architecture, it is applied to the YOLOv3-Tiny object detection CNN. Experimental results show that our proposed architecture with 128 CPE cores can operate at 62.8 frames per second on an FPGA Xilinx XCKU060 with a working frequency of 200 MHz, using 16-bit fixed-point representation. This approach results in only a 1% drop in mAP while utilizing 43.2K LUTs, 94.4K FFs, 26.73 Mbits of RAM, and 1364 DSPs. Furthermore, the number of external memory chips is reduced by 67% compared to the state-of-the-art systems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103468"},"PeriodicalIF":3.7,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel hybrid probabilistic–statistical error metrics for approximate adders","authors":"Vishesh Mishra , Sparsh Mittal , Urbi Chatterjee","doi":"10.1016/j.sysarc.2025.103467","DOIUrl":"10.1016/j.sysarc.2025.103467","url":null,"abstract":"<div><div>Approximate computing (AxC) has emerged as a promising approach for improving error-tolerant applications’ performance and energy efficiency. The approximate adder designs provide disproportionate energy and performance gains at the cost of a bounded loss in precision. The existing error metrics provide limited insights into error generation and propagation and correlate poorly with end-application quality-of-result (QoR). In this paper, we propose four novel error metrics for approximate adders that bring together the best of statistical and probabilistic approaches. These metrics are based on the probabilistic adder-dependent error-generation vector (<span><math><mrow><mover><mrow><mi>A</mi><mi>G</mi><mi>V</mi></mrow><mo>⃗</mo></mover><mo>,</mo></mrow></math></span>) and the input-dependent error-propagation vector (<span><math><mover><mrow><mi>I</mi><mi>P</mi><mi>V</mi></mrow><mo>⃗</mo></mover></math></span>). Our proposed metrics decouple error generation from propagation and model the impact of both adder characteristics and (application-dependent) input distribution. Extensive evaluation with 28 approximate adders over three real-world applications (Gaussian Smoothing, Support Vector Machine, and Neural Network evaluated on datasets such as MNIST, CIFAR-10, and ImageNet) shows that our metrics are more strongly correlated with application QoR than conventional metrics such as mean relative error distance (MRED), worst-case error (WCE) or error-rate (ER). Our metrics also help identify suitable adder designs for different applications. We will open-source our code.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103467"},"PeriodicalIF":3.7,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144221446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generative AI-based pipeline architecture for increasing training efficiency in intelligent weed control systems","authors":"Sourav Modak, Anthony Stein","doi":"10.1016/j.sysarc.2025.103464","DOIUrl":"10.1016/j.sysarc.2025.103464","url":null,"abstract":"<div><div>In automated crop protection tasks, deep learning has demonstrated significant potential. However, these advanced models rely heavily on high-quality, diverse datasets, which are often scarce and costly to obtain in agricultural settings. Traditional data augmentation techniques, while useful for increasing the volume of the dataset, often fail to capture the real-world variability needed for robust model training. In this paper, we present a novel method for generating synthetic images to enhance the training of deep learning-based object detection models for intelligent weed control, aiming to improve data efficiency. The architecture of our GenAI-based image generation pipeline integrates the Segment Anything Model (SAM) for zero-shot domain adaptation with a text-to-image Stable Diffusion Model, enabling the creation of synthetic images that can accurately reflect the idiosyncratic properties and appearances of a variety of real-world conditions. We further assess the application of these synthetic datasets on edge devices by evaluating state-of-the-art lightweight YOLO models, measuring data efficiency by comparing mAP50 and mAP50-95 scores among different proportions of real and synthetic training data. Incorporating these synthetic datasets into the training process has been found to result in notable improvements in terms of data efficiency. For instance, most YOLO models that are trained on a dataset consisting of 10% synthetic images and 90% real-world images typically demonstrate superior scores on mAP50 and mAP50-95 metrics compared to those trained solely on real-world images. The integration of this approach opens opportunities for achieving continual self-improvement of perception modules in intelligent technical systems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103464"},"PeriodicalIF":3.7,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144221447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mei Cao , Huiyu Wang , Yuan Yuan , Jianbo Lu , Xiaojun Cai , Dongxiao Yu , Mengying Zhao
{"title":"FedMQ+: Towards efficient heterogeneous federated learning with multi-grained quantization","authors":"Mei Cao , Huiyu Wang , Yuan Yuan , Jianbo Lu , Xiaojun Cai , Dongxiao Yu , Mengying Zhao","doi":"10.1016/j.sysarc.2025.103460","DOIUrl":"10.1016/j.sysarc.2025.103460","url":null,"abstract":"<div><div>Federated Learning (FL) is a distributed machine learning paradigm that enables collaborative model training while preserving client data privacy. Although this approach significantly enhances data privacy, it introduces substantial communication overhead. Quantization techniques mitigate this challenge by compressing model parameters into fewer bits. However, traditional quantization methods are primarily implemented at the client level, often overlooking the heterogeneous importance of distinct model parameters. In our previous work, FedMQ, we explored quantization mechanisms at both inter-client and intra-client levels to improve communication efficiency. Nevertheless, this approach compromises model accuracy due to the aggregation of models with limited expressive capability. Additionally, it lacks comprehensive theoretical analysis and mathematical verification. In this paper, we propose FedMQ+, an improved framework for heterogeneous federated learning with enhanced dequantization, to optimize global model performance. First, we design a precise dequantization strategy based on normal functions to accurately reconstruct full-precision weights from the given low-precision weights. Next, we conduct a rigorous theoretical analysis of the FedMQ+, establish an upper bound for its convergence, and mathematically demonstrate its <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>/</mo><mi>T</mi><mo>)</mo></mrow></mrow></math></span> convergence rate. Finally, we perform extensive experiments across diverse datasets and models. Experimental results demonstrate that FedMQ+ achieves significant improvements in convergence speed from 3.1% to 85.8%, while maintaining comparable model accuracy and achieving superior communication efficiency compared with state-of-the-art baselines.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103460"},"PeriodicalIF":3.7,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144203730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujuan Tan , Yan Gan , Zhaoyang Zeng , Zhuoxin Bai , Lei Qiao , Duo Liu , Kan Zhong , Ao Ren
{"title":"GNNBoost: Accelerating sampling-based GNN training on large scale graph by optimizing data preparation","authors":"Yujuan Tan , Yan Gan , Zhaoyang Zeng , Zhuoxin Bai , Lei Qiao , Duo Liu , Kan Zhong , Ao Ren","doi":"10.1016/j.sysarc.2025.103456","DOIUrl":"10.1016/j.sysarc.2025.103456","url":null,"abstract":"<div><div>Graph Neural Networks (GNNs) have successfully extended deep learning from traditional Euclidean spaces to complex graph structures. Sampling-based GNN training has been widely adopted for large-scale graphs without compromising accuracy. However, the graph irregularity results in imbalanced sampling workloads, making it challenging for existing GNN systems to effectively utilize GPU resources for graph sampling. Additionally, in GNN systems where both topology and feature caches are enabled, differences in characteristics and purposes of cache data complicate the allocation of GPU memory for these two caches with minimal overhead. To address these challenges, we propose GNNBoost, a framework designed to accelerate GNN training. GNNBoost consists of two key innovations. First, GNNBoost introduces a degree-oriented sampling schedule that groups training vertices based on their degrees and applies tailored sampling strategies to balance GPU workloads and improve sampling performance. Second, GNNBoost develops a low-overhead cache space allocation mechanism that accurately determines the optimal cache sizes for graph topology and features across different workloads, minimizing both space and time overheads. We conduct a comprehensive evaluation of GNNBoost through various GNN models and large graph datasets, demonstrating that it significantly outperforms existing GNN training systems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103456"},"PeriodicalIF":3.7,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144230253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenyan Yan , Bin Fu , Dongsheng Wei , Renfa Li , Yixue Lei , Yuhang Jia , Guoqi Xie
{"title":"PFV2: Packet fragmentation with variable size and vigorous mapping in time-sensitive networking","authors":"Wenyan Yan , Bin Fu , Dongsheng Wei , Renfa Li , Yixue Lei , Yuhang Jia , Guoqi Xie","doi":"10.1016/j.sysarc.2025.103457","DOIUrl":"10.1016/j.sysarc.2025.103457","url":null,"abstract":"<div><div>With the rapid advancement of intelligent automobiles, the ever-growing communication data put forward high bandwidth and low latency requirements. To meet these requirements, automotive Original Equipment Manufacturers (OEMs) widely adopt domain-centralized Electrical/Electronic (E/E) architecture. In this architecture, Time-Sensitive Networking (TSN) is expected to serve as the backbone network because of its high bandwidth and deterministic communication. TSN uses the Gate Control List (GCL) to divide its time length into multiple time slots, and the time intervals of these time slots are non-uniform (equal or unequal). Different flows have varying requirements for time slot sizes. To enhance the acceptance ratio of Time-Triggered (TT) flows through GCL time slot allocation, packet fragmentation (i.e., flow fragmentation) is introduced into TSN in the recent study. The state-of-the-art packet fragmentation solution divides one un-schedulable TT flow into multiple equal-sized packets (i.e., equal-sized time slots). In other words, these fixed-size packets are difficult to be mapped into the time slots with different sizes.</div><div>This study develops a Packet Fragmentation with Variable-size and Vigorous-mapping (PFV2) technique based on the following three innovations: (1) we implement variable-size packet fragmentation, which iteratively divides the un-schedulable TT flow into smaller packets and then dynamically reschedules these packets; (2) we implement the vigorous mapping solution from packets to time slots by deeply searching for available time slots within the flow’s deadline; and (3) we verify PFV2 based on the LS1028A with Cortex-A72 (i.e., the NXP automotive-grade development board). PFV2 improves the acceptance ratio by up to 20.18% and bandwidth utilization by up to 7.024% compared with the state-of-the-art solution. The theoretical and practical co-verification experiments demonstrate that the PFV2 can effectively improve the flow acceptance ratio and outperform the state-of-the-art solution.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103457"},"PeriodicalIF":3.7,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144184518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenhua Liu , Yujing Yuan , Zhiqing Chen , Yu Han , Baocang Wang
{"title":"Efficient registration-based signature schemes","authors":"Zhenhua Liu , Yujing Yuan , Zhiqing Chen , Yu Han , Baocang Wang","doi":"10.1016/j.sysarc.2025.103458","DOIUrl":"10.1016/j.sysarc.2025.103458","url":null,"abstract":"<div><div>Registration-based cryptography is a newly emerging variant of the public-key cryptosystem that addresses the key-escrow issue inherent in identity-based encryption. However, the key-escrow problem in identity-based signature remains to be solved. In this work, we present a basic registration-based signature (RBS) scheme, which removes the trusted third-party that generates secret keys in identity-based signature. In the proposed basic RBS scheme, a trie tree is used to expand the identity space for supporting large space of users’ identities (e.g., arbitrary strings), and vector commitment is utilized to bind public key to user’s identity and accumulate the registered public keys that enable the verification. The basic scheme is proven to be EUF-CMA based on the CDH assumption in the random oracle model. What is more, we provide two improvements for the basic scheme, including an adaptive RBS scheme and an updatable RBS scheme. Finally, the simulated implementation shows the proposed basic RBS scheme is practical and viable on a scale encompassing millions of users.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103458"},"PeriodicalIF":3.7,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144184519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BQProfit: Budget and QoS aware task offloading and resource allocation for Profit maximization under MEC platform","authors":"Akhirul Islam, Manojit Ghose","doi":"10.1016/j.sysarc.2025.103447","DOIUrl":"10.1016/j.sysarc.2025.103447","url":null,"abstract":"<div><div>In the era of Multi-access Edge Computing (MEC), efficient partitioning of applications and, thereafter, allocation of edge resources to the tasks of the applications is crucial for maximizing the profit of the service provider. Although a few studies have been in recent times, they overlooked many essential parameters such as the dependency on tasks, deadline and demand-based dynamic charging plan, cooperation among edge servers, etc. This paper proposes a novel strategy named <em>BQProfit</em> that has two essential components: a Modified Kernighan–Lin (MKL) task offloading approach and a metaheuristic-based genetic algorithm known as Cost Optimized Resource Allocation (CORA). To evaluate the effectiveness of the proposed strategy, we perform an extensive simulation using synthetic and scientific workflow data sets and compare the result with a baseline policy and two state-of-the-art policies. On average, BQProfit increases the service provider’s profit by 4.1x compared to the benchmark policies and by 1.5x compared to the best-performing benchmark policy. Our strategy also outperforms these policies by reducing task failure rates by 34.06% on average, and 25% compared to the best-performing benchmark policy. Additionally, BQProfit shows an average improvement of 70.83% over the state-of-the-art and 44% over the top-performing benchmark policy.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103447"},"PeriodicalIF":3.7,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xianghong Hu , Shansen Fu , Yuanmiao Lin , Xueming Li , Chaoming Yang , Rongfeng Li , Hongmin Huang , Shuting Cai , Xiaoming Xiong
{"title":"An FPGA-based bit-level weight sparsity and mixed-bit accelerator for neural networks","authors":"Xianghong Hu , Shansen Fu , Yuanmiao Lin , Xueming Li , Chaoming Yang , Rongfeng Li , Hongmin Huang , Shuting Cai , Xiaoming Xiong","doi":"10.1016/j.sysarc.2025.103463","DOIUrl":"10.1016/j.sysarc.2025.103463","url":null,"abstract":"<div><div>Bit-level weight sparsity and mixed-bit quantization are regarded as effective methods to improve the computing efficiency of convolutional neural network (CNN) accelerators. However, irregular sparse matrices will greatly increase the index overhead and hardware resource consumption. Moreover, bit-serial computing (BSC) is usually adopted to implement bit-level weight sparsity on accelerators, and the traditional BSC leads to uneven utilization of DSP and LUT resources on the FPGA platform, thereby limiting the improvement of the overall performance of the accelerator. Therefore, in this work, we present an accelerator designed for bit-level weight sparsity and mixed-bit quantization. We first introduce a non-linear quantization algorithm named bit-level sparsity learned quantizer (BSLQ), which can maintain high accuracy during mixed quantization and guide the accelerator to complete bit-level weight sparse computations using DSP. Based on this algorithm, we implement the multi-channel bit-level sparsity (MCBS) method to mitigate irregularities and reduce the index count associated with bit-level sparsity. Finally, we propose a sparse weight arbitrary basis scratch pad (SWAB SPad) method that enables retrieval of compressed weights without fetching activations, which can save 30.52% of LUTs and 64.02% of FFs. Experimental results demonstrate that when quantizing ResNet50 and VGG16 using 4/8 bits, our approach achieves accuracy that is comparable to or even better than 32-bit (75.98% and 73.70% for the two models). Compared to the state-of-the-art FPGA-based accelerators, this accelerator achieves up to 5.36 times DSP efficiency improvement and provides 8.87 times energy efficiency improvement.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103463"},"PeriodicalIF":3.7,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144190533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}