{"title":"Computing Tasks Saving Schemes Through Early Exit in Edge Intelligence-Assisted Systems","authors":"Xin Niu;Xianwei Lv;Wang Chen;Chen Yu;Hai Jin","doi":"10.1109/TC.2025.3533098","DOIUrl":"https://doi.org/10.1109/TC.2025.3533098","url":null,"abstract":"Edge intelligence (EI) is a promising paradigm where end devices collaborate with edge servers to provide artificial intelligence services to users. In most realistic scenarios, end devices often move unconsciously, resulting in frequent computing migrations. Moreover, a surge in computing tasks offloaded to edge servers significantly prolongs queuing latency. These two issues obstruct the timely completion of computing tasks in EI-assisted systems. In this paper, we formulate an optimization problem aiming to maximize computing task completion under latency constraints. To address this issue, we first categorize computing tasks into new computing tasks (NCTs) and partially completed computing tasks (PCTs). Subsequently, based on model partitioning, we design a new computing task saving scheme (NSS) to optimize early exit points for NCTs and computing tasks in the queuing queue. Furthermore, we propose a partially completed computing task saving scheme (PSS) to set early exit points for PCTs during computing migrations. Numerous experiments show that computing saving schemes can achieve at least 90% computing task completion rate and up to 61.81% latency reduction compared to other methods.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1565-1576"},"PeriodicalIF":3.6,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10854688","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors","authors":"Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/TC.2025.3533083","DOIUrl":"https://doi.org/10.1109/TC.2025.3533083","url":null,"abstract":"Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called <monospace>vindexmac</monospace>, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1446-1460"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Graph Structure of Baker's Maps Implemented on a Computer","authors":"Chengqing Li;Kai Tan","doi":"10.1109/TC.2025.3533094","DOIUrl":"https://doi.org/10.1109/TC.2025.3533094","url":null,"abstract":"The complex dynamics of the baker's map and its variants in infinite-precision mathematical domains and quantum settings have been extensively studied over the past five decades. However, their behavior in finite-precision digital computing remains largely unknown. This paper addresses this gap by investigating the graph structure of the generalized two-dimensional baker's map and its higher-dimensional extension, referred to as HDBM, as implemented on the discrete setting in a digital computer. We provide a rigorous analysis of how the map parameters shape the in-degree bounds and distribution within the functional graph, revealing fractal-like structures intensify as parameters approach each other and arithmetic precision increases. Furthermore, we demonstrate that recursive tree structures can characterize the functional graph structure of HDBM in a fixed-point arithmetic domain. Similar to the 2-D case, the degree of any non-leaf node in the functional graph, when implemented in the floating-point arithmetic domain, is determined solely by its last component. We also reveal the relationship between the functional graphs of HDBM across the two arithmetic domains. These findings lay the groundwork for dynamic analysis, effective control, and broader application of the baker's map and its variants in diverse domains.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1524-1537"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pruning-Based Adaptive Federated Learning at the Edge","authors":"Dongxiao Yu;Yuan Yuan;Yifei Zou;Xiao Zhang;Yu Liu;Lizhen Cui;Xiuzhen Cheng","doi":"10.1109/TC.2025.3533095","DOIUrl":"https://doi.org/10.1109/TC.2025.3533095","url":null,"abstract":"Federated Learning (FL) is a new learning framework in which <inline-formula><tex-math>$s$</tex-math></inline-formula> clients collaboratively train a model under the guidance of a central server. Meanwhile, with the advent of the era of large models, the parameters of models are facing explosive growth. Therefore, it is important to design federated learning algorithms for edge environment. However, the edge environment is severely limited in computing, storage, and network bandwidth resources. Concurrently, adaptive gradient methods show better performance than constant learning rate in non-distributed settings. In this paper, we propose a pruning-based distributed Adam (PD-Adam) algorithm, which combines model pruning and adaptive learning steps to achieve asymptotically optimal convergence rate of <inline-formula><tex-math>$O(1/sqrt[4]{K})$</tex-math></inline-formula>. At the same time, the algorithm can achieve convergence consistent with the centralized model. Finally, extensive experiments have confirmed the convergence of our algorithm, demonstrating its reliability and effectiveness across various scenarios. Specially, our proposed algorithm is <inline-formula><tex-math>$2$</tex-math></inline-formula>% and <inline-formula><tex-math>$18$</tex-math></inline-formula>% more accurate than the current state-of-the-art FedAvg algorithm on the ResNet and CIFAR datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1538-1548"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Slack Time Management for Imprecise Mixed-Criticality Systems With Reliability Constraints","authors":"Yi-Wen Zhang;Hui Zheng","doi":"10.1109/TC.2025.3533100","DOIUrl":"https://doi.org/10.1109/TC.2025.3533100","url":null,"abstract":"A Mixed-Criticality System (MCS) integrates multiple applications with different criticality levels on the same hardware platform. For power and energy-constrained systems such as Unmanned Aerial Vehicles, it is important to minimize energy consumption of the computing system while meeting reliability constraints. In this paper, we first determine the number of tolerated faults according to the given reliability target. Second, we propose a schedulability test for MCS with semi-clairvoyance and checkpointing. Third, we propose the Energy-Aware Scheduling with Reliability Constraint (EASRC) scheduling algorithm for MCS with semi-clairvoyance and checkpointing. It consists of an offline phase and an online phase. In the offline phase, we determine the offline processor speed by reclaiming static slack time. In the online phase, we adjust the processor speed by reclaiming dynamic slack time to further save energy. Finally, we show the performance of our proposed algorithm through experimental evaluations. The results show that the proposed algorithm can save an average of 9.67% of energy consumption compared with existing methods.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1577-1588"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuang Li;Changyao Tan;Gang Liu;Yanhua Wen;Yan Wang;Kenli Li
{"title":"DC-ORAM: An ORAM Scheme Based on Dynamic Compression of Data Blocks and Position Map","authors":"Chuang Li;Changyao Tan;Gang Liu;Yanhua Wen;Yan Wang;Kenli Li","doi":"10.1109/TC.2025.3533089","DOIUrl":"https://doi.org/10.1109/TC.2025.3533089","url":null,"abstract":"Oblivious RAM (ORAM) is an efficient cryptographic primitive that prevents leakage of memory access patterns. It has been referenced by modern secure processors and plays an important role in memory security protection. Although the most advanced ORAM has made great progress in performance optimization, the access overhead (i.e., data blocks) and on-chip (i.e., PosMap) storage overhead is still too high, which will lead to problems such as low system performance. To overcome the above challenges, in this paper, we propose a DC-ORAM system, which reduces the data access overhead and on-chip PosMap storage overhead by using dynamic compression technology. Specifically, we use byte stream redundancy compression technology to compress data blocks on the ORAM tree. And in PosMap, a high-bit multiplexing strategy is used to achieve data compression for binary high-bit repeated data of leaf labels (or path labels). By introducing the above compression technology, in this work, compared with conventional Path ORAM, the compression rate of the ORAM tree is <inline-formula><tex-math>$52.9%$</tex-math></inline-formula>, and the compression rate of PosMap is <inline-formula><tex-math>$40.0%$</tex-math></inline-formula>. In terms of performance, compared to conventional Path ORAM, our proposed DC-ORAM system reduces the average latency by <inline-formula><tex-math>$33.6%$</tex-math></inline-formula>. In addition, we apply the compression technology proposed in this work to the Ring ORAM system. By comparison, it is found that with the same compression ratio as Path ORAM, our design can still reduce latency by an average of <inline-formula><tex-math>$21.5%$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1495-1509"},"PeriodicalIF":3.6,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Caching Dependency-Aware Task Offloading in Mobile Edge Computing","authors":"Liang Zhao;Zijia Zhao;Ammar Hawbani;Zhi Liu;Zhiyuan Tan;Keping Yu","doi":"10.1109/TC.2025.3533091","DOIUrl":"https://doi.org/10.1109/TC.2025.3533091","url":null,"abstract":"Mobile Edge Computing (MEC) is a distributed computing paradigm that provides computing capabilities at the periphery of mobile cellular networks. This architecture empowers Mobile Users (MUs) to offload computation-intensive applications to large-scale computing nodes near the edge side, reducing application latency for MUs. The resource allocation and task offloading in MEC has been widely studied. However, the burgeoning complexity inherent to modern applications, often represented as Directed Acyclic Graphs (DAGs) comprising a multitude of subtasks with interdependencies, poses huge challenges for application offloading and resource allocation. Meanwhile, previous work has neglected the impact of edge caching on the offloading execution of dependent tasks. Therefore, this paper introduces a novel dynamic <underline>cach</u>ing dependency-aware task <underline>of</u>floading (CachOf) scheme. First, to effectively enhance the rationality of cache and computing resource allocation, we develop a subtask priority computation scheme based on DAG dependencies. This scheme includes the execution sequence priority of subtasks on a single MU and the offloading sequence priority of subtasks from multiple MUs. Second, a dynamic caching scheme, designed to cater to dependent tasks, is proposed. This caching approach can not only assist offloading decisions, but also contribute to load balancing by harmonizing caching resources among edge servers. Finally, based on the task prioritization results and caching results, this paper presents a Deep Reinforcement Learning (DRL)-based offloading scheme to judiciously allocate resources and improve the execution efficiency of applications. Extensive simulation experiments demonstrate that CachOf outperforms other baseline schemes, achieving improved execution efficiency for applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1510-1523"},"PeriodicalIF":3.6,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Trojan Detection Methods for Gate-Level Netlists Based on Graph Neural Networks","authors":"Peijun Ma;Jie Li;Hongjin Liu;Jiangyi Shi;Shaolin Zhang;Weitao Pan;Yue Hao","doi":"10.1109/TC.2025.3533085","DOIUrl":"https://doi.org/10.1109/TC.2025.3533085","url":null,"abstract":"Currently, untrusted third-party entities are increasingly involved in various stages of IC design and manufacturing, posing a significant threat to the reliability and security of SoCs due to the presence of hardware Trojans (HTs). In this paper, gate-level HT detection methods based on graph neural networks (GNNs) are established to overcome the defects of existing machine learning, which makes it difficult to characterize circuit connection relationships. We introduce harmonic centrality in the feature engineering of gate-level HT detection, which reflects the positional information of nodes and their adjacent nodes in the graph, thereby enhancing the quality of feature engineering. We use the golden section weight optimization algorithm to configure penalty weights to alleviate the problem of extreme data imbalance. In the SAED database, GraphSAGE-LSTM model obtained a TPR of 88.06% and an average F1 score of 90.95%. In the combined HT netlist of LEDA datasets, GraphSAGE-POOL model obtains a TPR of 88.50% and the best F1 score of 92.17%. In sequential HT netlist, GraphSAGE-LSTM model performs optimally, with a TPR of 98.25% and an average F1 score of 98.59%. Compared to existing detection models, the F1 score is enhanced by 8.86% and 2.48% on combined and sequential HT datasets, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1470-1481"},"PeriodicalIF":3.6,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SLOpt: Serving Real-Time Inference Pipeline With Strict Latency Constraint","authors":"Zhixin Zhao;Yitao Hu;Guotao Yang;Ziqi Gong;Chen Shen;Laiping Zhao;Wenxin Li;Xiulong Liu;Wenyu Qu","doi":"10.1109/TC.2025.3528125","DOIUrl":"https://doi.org/10.1109/TC.2025.3528125","url":null,"abstract":"The rise of machine learning as a service (MLaaS) has driven the demand for complex and customized real-time inference tasks, often requiring cascading multiple deep neural network (DNN) models into inference pipelines. However, these pipelines pose significant challenges due to scheduling complexity, particularly in maintaining strict latency service level objectives (SLOs). Existing systems serve pipelines with model-independent scheduling policies, which ignore the unique workload characteristics introduced by model cascading in the inference pipeline, leading to SLO violations and resource inefficiencies. In this paper, we propose that the serving system should exploit the model-cascading nature and intermodel workload dependency of the inference pipeline to ensure strict latency SLO cost-effectively. Based on this, we design and implement <monospace>SLOpt</monospace>, a serving system optimized for real-time inference pipelines with a three-stage codesign of workload estimation, resource provisioning, and request execution. <monospace>SLOpt</monospace> proposes cascade workload estimation and ahead-of-time tuning, which together address the challenge of cascade blocking and head-of-line blocking in workload estimation and resource provisioning. <monospace>SLOpt</monospace> further implements an adaptive batch drop policy to mitigate latency amplification issues within the pipeline. These innovations enable <monospace>SLOpt</monospace> to reduce the 99th percentile latency (P99 latency) by <inline-formula><tex-math>$1.4$</tex-math></inline-formula> to <inline-formula><tex-math>$2.5$</tex-math></inline-formula> times compared to the state of the arts while lowering serving costs by up to <inline-formula><tex-math>$29%$</tex-math></inline-formula>. Moreover, to achieve comparable P99 latency, <monospace>SLOpt</monospace> requires up to <inline-formula><tex-math>$70%$</tex-math></inline-formula> less cost than existing systems. Extensive evaluations on a 64-GPU cluster demonstrate <monospace>SLOpt</monospace>'s effectiveness in meeting strict P99 latency SLOs under diverse real-world workloads.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1431-1445"},"PeriodicalIF":3.6,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}