{"title":"A Complexity-Effective Local Delta Prefetcher","authors":"Agustín Navarro-Torres;Biswabandan Panda;Jesús Alastruey-Benedé;Pablo Ibáñez;Víctor Viñals-Yúfera;Alberto Ros","doi":"10.1109/TC.2025.3533086","DOIUrl":"https://doi.org/10.1109/TC.2025.3533086","url":null,"abstract":"Data prefetching is crucial for performance in modern processors by effectively masking long-latency memory accesses. Over the past decades, numerous data prefetching mechanisms have been proposed, which have continuously reduced the access latency to the memory hierarchy. Several state-of-the-art prefetchers, namely Instruction Pointer Classifier Prefetcher (IPCP) and Berti, target the first-level data cache, and thus, they are able to completely hide the miss latency for timely prefetched cache lines. Berti exploits timely local deltas to achieve high accuracy and performance. This paper extends Berti with a larger evaluation and with extra optimizations on top of the previous conference paper. The result is a complexity-effective version of Berti that outperforms it for a large amount of workloads and simplifies its control logic. The key for those advancements is a simple mechanism for learning timely deltas without the need to track the fetch latency of each cache miss. Our experiments conducted with a wide range of workloads (CVP traces by Qualcomm, SPEC CPU2017, and GAP) show performance improvements by 4.0% over a mainstream stride prefetcher, and by a non-negligible 1.4% over the previously published version of Berti requiring similar storage.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1482-1494"},"PeriodicalIF":3.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RGKV: A GPGPU-Empowered Compaction Framework for LSM-Tree-Based KV Stores With Optimized Data Transfer and Parallel Processing","authors":"Hui Sun;Xiangxiang Jiang;Yinliang Yue;Xiao Qin","doi":"10.1109/TC.2025.3535832","DOIUrl":"https://doi.org/10.1109/TC.2025.3535832","url":null,"abstract":"The Log-structured merge-tree (LSM-tree), widely adopted in key-value stores (KV stores), is esteemed for its efficient write performance and superb scalability amid large-scale data processing. The compaction process of LSM-trees consumes significant computational resources, thereby becoming a bottleneck for system performance. Traditionally, compaction is handled by CPUs, but CPU processing capacity often falls short of increasing demands with the surge in data volumes. To address this challenge, existing solutions attempt to accelerate compaction using GPGPUs. Due to low GPGPU parallelism and data transfer delay in prior studies, the anticipated performance improvements have not yet been fully realized. In this paper, we bring forth RGKV – a comprehensive optimization approach to overcoming the limitations of current GPGPU-empowered KV stores. RGKV features the GPGPU-adapted contiguous memory allocation and GPGPU-optimized key-value block architecture to furnish high-efficient GPGPU parallel encoding and decoding catering to the needs of KV stores. To enhance the computational efficiency and overall performance of KV stores, RGKV employs a parallel merge-sorting algorithm to maximize the parallel processing capabilities of the GPGPU. Moreover, RGKV incorporates a data transfer module anchored on the GPUDirect storage technology – designed for KV stores – and designs an efficient data structure to substantially curtail data transfer latency between an SSD and a GPGPU, boosting data transfer speed and alleviating CPU load. The experimental results demonstrate that RGKV achieves a remarkable 4<inline-formula><tex-math>$times$</tex-math></inline-formula> improvement in overall throughput and a 7<inline-formula><tex-math>$times$</tex-math></inline-formula> improvement in compaction throughput compared to the state-of-the-art KV stores, while also reducing average write latency by 70.6%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1605-1619"},"PeriodicalIF":3.6,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liangyuan Wang;Xudong Liu;Haonan Ding;Yi Hu;Kai Peng;Menglan Hu
{"title":"Energy-Delay-Aware Joint Microservice Deployment and Request Routing With DVFS in Edge: A Reinforcement Learning Approach","authors":"Liangyuan Wang;Xudong Liu;Haonan Ding;Yi Hu;Kai Peng;Menglan Hu","doi":"10.1109/TC.2025.3535826","DOIUrl":"https://doi.org/10.1109/TC.2025.3535826","url":null,"abstract":"The emerging microservice architecture offers opportunities for accommodating delay-sensitive applications in edge. However, such applications are computation-intensive and energy-consuming, imposing great difficulties to edge servers with limited computing resources, energy supply, and cooling capabilities. To reduce delay and energy consumption in edge, efficient microservice orchestration is necessary, but significantly challenging. Due to frequent communications among multiple microservices, service deployment and request routing are tightly-coupled, which motivates a complex joint optimization problem. When considering multi-instance modeling and fine-grained orchestration for massive microservices, the difficulty is extremely enlarged. Nevertheless, previous work failed to address the above difficulties. Also, they neglected to balance delay and energy, especially lacking dynamic energy-saving abilities. Therefore, this paper minimizes energy and delay by jointly optimizing microservice deployment and request routing via multi-instance modeling, fine-grained orchestration, and dynamic adaptation. Our queuing network model enables accurate end-to-end time analysis covering queuing, computing, and communicating delays. We then propose a delay-aware reinforcement learning algorithm, which derives the static service deployment and routing decisions. Moreover, we design an energy-aware dynamic frequency scaling algorithm, which saves energy with fluctuating request patterns. Experiment results demonstrate that our approaches significantly outperform baseline algorithms in both delay and energy consumption.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1589-1604"},"PeriodicalIF":3.6,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Small Hazard-Free Transducers","authors":"Johannes Bund;Christoph Lenzen;Moti Medina","doi":"10.1109/TC.2025.3533096","DOIUrl":"https://doi.org/10.1109/TC.2025.3533096","url":null,"abstract":"In digital circuits, hazardous input signals are a result of spurious operation of bistable elements. For example, the problem occurs in circuits with asynchronous inputs or clock domain crossings. Marino (TC’81) showed that hazards in bistable elements are inevitable. Hazard-free circuits compute the “most stable” output possible on hazardous inputs, under the constraint that it returns the same output as the circuit on stable inputs. Ikenmeyer et al. (JACM’19) proved an unconditional exponential separation between the hazard-free complexity and (standard) circuit complexity of explicit functions. Despite that, asymptotically optimal hazard-free sorting circuit are possible (Bund et al., TC’19). This raises the question: Which classes of functions permit efficient hazard-free circuits? We prove that circuit implementations of transducers with small state space are such a class. A transducer is a finite state machine that transcribes, symbol by symbol, an input string of length n into an output string of length n. We present a construction that transforms any function arising from a transducer into an efficient circuit that computes the hazard-free extension of the function. For transducers with constant state space, the circuit has asymptotically optimal size, with small constants if the state space is small.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1549-1564"},"PeriodicalIF":3.6,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing Tasks Saving Schemes Through Early Exit in Edge Intelligence-Assisted Systems","authors":"Xin Niu;Xianwei Lv;Wang Chen;Chen Yu;Hai Jin","doi":"10.1109/TC.2025.3533098","DOIUrl":"https://doi.org/10.1109/TC.2025.3533098","url":null,"abstract":"Edge intelligence (EI) is a promising paradigm where end devices collaborate with edge servers to provide artificial intelligence services to users. In most realistic scenarios, end devices often move unconsciously, resulting in frequent computing migrations. Moreover, a surge in computing tasks offloaded to edge servers significantly prolongs queuing latency. These two issues obstruct the timely completion of computing tasks in EI-assisted systems. In this paper, we formulate an optimization problem aiming to maximize computing task completion under latency constraints. To address this issue, we first categorize computing tasks into new computing tasks (NCTs) and partially completed computing tasks (PCTs). Subsequently, based on model partitioning, we design a new computing task saving scheme (NSS) to optimize early exit points for NCTs and computing tasks in the queuing queue. Furthermore, we propose a partially completed computing task saving scheme (PSS) to set early exit points for PCTs during computing migrations. Numerous experiments show that computing saving schemes can achieve at least 90% computing task completion rate and up to 61.81% latency reduction compared to other methods.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1565-1576"},"PeriodicalIF":3.6,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10854688","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors","authors":"Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/TC.2025.3533083","DOIUrl":"https://doi.org/10.1109/TC.2025.3533083","url":null,"abstract":"Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called <monospace>vindexmac</monospace>, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1446-1460"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Graph Structure of Baker's Maps Implemented on a Computer","authors":"Chengqing Li;Kai Tan","doi":"10.1109/TC.2025.3533094","DOIUrl":"https://doi.org/10.1109/TC.2025.3533094","url":null,"abstract":"The complex dynamics of the baker's map and its variants in infinite-precision mathematical domains and quantum settings have been extensively studied over the past five decades. However, their behavior in finite-precision digital computing remains largely unknown. This paper addresses this gap by investigating the graph structure of the generalized two-dimensional baker's map and its higher-dimensional extension, referred to as HDBM, as implemented on the discrete setting in a digital computer. We provide a rigorous analysis of how the map parameters shape the in-degree bounds and distribution within the functional graph, revealing fractal-like structures intensify as parameters approach each other and arithmetic precision increases. Furthermore, we demonstrate that recursive tree structures can characterize the functional graph structure of HDBM in a fixed-point arithmetic domain. Similar to the 2-D case, the degree of any non-leaf node in the functional graph, when implemented in the floating-point arithmetic domain, is determined solely by its last component. We also reveal the relationship between the functional graphs of HDBM across the two arithmetic domains. These findings lay the groundwork for dynamic analysis, effective control, and broader application of the baker's map and its variants in diverse domains.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1524-1537"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pruning-Based Adaptive Federated Learning at the Edge","authors":"Dongxiao Yu;Yuan Yuan;Yifei Zou;Xiao Zhang;Yu Liu;Lizhen Cui;Xiuzhen Cheng","doi":"10.1109/TC.2025.3533095","DOIUrl":"https://doi.org/10.1109/TC.2025.3533095","url":null,"abstract":"Federated Learning (FL) is a new learning framework in which <inline-formula><tex-math>$s$</tex-math></inline-formula> clients collaboratively train a model under the guidance of a central server. Meanwhile, with the advent of the era of large models, the parameters of models are facing explosive growth. Therefore, it is important to design federated learning algorithms for edge environment. However, the edge environment is severely limited in computing, storage, and network bandwidth resources. Concurrently, adaptive gradient methods show better performance than constant learning rate in non-distributed settings. In this paper, we propose a pruning-based distributed Adam (PD-Adam) algorithm, which combines model pruning and adaptive learning steps to achieve asymptotically optimal convergence rate of <inline-formula><tex-math>$O(1/sqrt[4]{K})$</tex-math></inline-formula>. At the same time, the algorithm can achieve convergence consistent with the centralized model. Finally, extensive experiments have confirmed the convergence of our algorithm, demonstrating its reliability and effectiveness across various scenarios. Specially, our proposed algorithm is <inline-formula><tex-math>$2$</tex-math></inline-formula>% and <inline-formula><tex-math>$18$</tex-math></inline-formula>% more accurate than the current state-of-the-art FedAvg algorithm on the ResNet and CIFAR datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1538-1548"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Slack Time Management for Imprecise Mixed-Criticality Systems With Reliability Constraints","authors":"Yi-Wen Zhang;Hui Zheng","doi":"10.1109/TC.2025.3533100","DOIUrl":"https://doi.org/10.1109/TC.2025.3533100","url":null,"abstract":"A Mixed-Criticality System (MCS) integrates multiple applications with different criticality levels on the same hardware platform. For power and energy-constrained systems such as Unmanned Aerial Vehicles, it is important to minimize energy consumption of the computing system while meeting reliability constraints. In this paper, we first determine the number of tolerated faults according to the given reliability target. Second, we propose a schedulability test for MCS with semi-clairvoyance and checkpointing. Third, we propose the Energy-Aware Scheduling with Reliability Constraint (EASRC) scheduling algorithm for MCS with semi-clairvoyance and checkpointing. It consists of an offline phase and an online phase. In the offline phase, we determine the offline processor speed by reclaiming static slack time. In the online phase, we adjust the processor speed by reclaiming dynamic slack time to further save energy. Finally, we show the performance of our proposed algorithm through experimental evaluations. The results show that the proposed algorithm can save an average of 9.67% of energy consumption compared with existing methods.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1577-1588"},"PeriodicalIF":3.6,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuang Li;Changyao Tan;Gang Liu;Yanhua Wen;Yan Wang;Kenli Li
{"title":"DC-ORAM: An ORAM Scheme Based on Dynamic Compression of Data Blocks and Position Map","authors":"Chuang Li;Changyao Tan;Gang Liu;Yanhua Wen;Yan Wang;Kenli Li","doi":"10.1109/TC.2025.3533089","DOIUrl":"https://doi.org/10.1109/TC.2025.3533089","url":null,"abstract":"Oblivious RAM (ORAM) is an efficient cryptographic primitive that prevents leakage of memory access patterns. It has been referenced by modern secure processors and plays an important role in memory security protection. Although the most advanced ORAM has made great progress in performance optimization, the access overhead (i.e., data blocks) and on-chip (i.e., PosMap) storage overhead is still too high, which will lead to problems such as low system performance. To overcome the above challenges, in this paper, we propose a DC-ORAM system, which reduces the data access overhead and on-chip PosMap storage overhead by using dynamic compression technology. Specifically, we use byte stream redundancy compression technology to compress data blocks on the ORAM tree. And in PosMap, a high-bit multiplexing strategy is used to achieve data compression for binary high-bit repeated data of leaf labels (or path labels). By introducing the above compression technology, in this work, compared with conventional Path ORAM, the compression rate of the ORAM tree is <inline-formula><tex-math>$52.9%$</tex-math></inline-formula>, and the compression rate of PosMap is <inline-formula><tex-math>$40.0%$</tex-math></inline-formula>. In terms of performance, compared to conventional Path ORAM, our proposed DC-ORAM system reduces the average latency by <inline-formula><tex-math>$33.6%$</tex-math></inline-formula>. In addition, we apply the compression technology proposed in this work to the Ring ORAM system. By comparison, it is found that with the same compression ratio as Path ORAM, our design can still reduce latency by an average of <inline-formula><tex-math>$21.5%$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1495-1509"},"PeriodicalIF":3.6,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}