{"title":"ZeroTracer: In-Band eBPF-Based Trace Generator With Zero Instrumentation for Microservice Systems","authors":"Wanqi Yang;Pengfei Chen;Kai Liu;Huxing Zhang","doi":"10.1109/TPDS.2025.3571934","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3571934","url":null,"abstract":"Microservice enables agility in modern cloud-native applications but introduces challenges in fault troubleshooting due to its complex service coordination and cooperation. To tackle these challenges, distributed tracing has emerged for end-to-end request tracing and system understanding. However, existing tracing solutions often suffer from code instrumentation, trace loss and inaccuracy. To overcome these limitations, we introduce ZeroTracer, an in-kernel online distributed tracing system equipped with an eBPF-based (extended Berkeley Packet Filter) trace generator. ZeroTracer tailors for tracking HTTP requests due to its popularity in microservice systems. In our evaluations, ZeroTracer achieves remarkable trace accuracy (i.e., over 91% ) and maintains stable performance under different workload concurrency. Moreover, ZeroTracer outperforms other non-invasive approaches which fail to reconcile accurate request causality. Notably, ZeroTracer effectively tracks end-to-end requests in multi-threaded microservice applications, which is absent in existing invasive distributed tracing systems with third-party library instrumentation. Moreover, ZeroTracer introduces a negligible overhead, with latency increasing by only 0.5% –1.2% and a modest 3% –5.8% increase in CPU and memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1478-1494"},"PeriodicalIF":5.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li
{"title":"GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures","authors":"Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li","doi":"10.1109/TPDS.2025.3552679","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3552679","url":null,"abstract":"GPUs are widely used hardware acceleration platforms in many areas due to their great computational throughput. In the meanwhile, GPUs are vulnerable to transient hardware faults in the post-Moore era. Analyzing the error resilience of GPUs are critical for both hardware and software. Statistical fault injection approaches are commonly used for error resilience analysis, which are highly accurate but very time consuming. In this work, we propose GEREM, a first framework to speed up fault injection process so as to estimate the error resilience of GPU microarchitectures swiftly and precisely. We find early fault behaviors can be used to accurately predict the final outcomes of program execution. Based on this observation, we categorize the early behaviors of hardware faults into GPU Early Fault Manifestation models (EFMs). For data structures, EFMs are early propagation characteristics of faults, while for pipeline instructions, EFMs are heuristic properties of several instruction contexts. We further observe that EFMs are determined by static microarchitecture states, so we can capture them without actually simulating the program execution process under fault injections. Leveraging these observations, our GEREM framework first profiles the microarchitectural states related for EFMs at one time. It then injects faults into the profiled traces to immediately generate EFMs. For data storage structures, EFMs are directly used to predict final fault outcomes, while for pipeline instructions, machine learning is used for prediction. Evaluation results show GEREM precisely assesses the error resilience of GPU microarchitecture structures with <inline-formula><tex-math>$237times$</tex-math></inline-formula> speedup on average comparing with traditional fault injections.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1011-1024"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters","authors":"Sahil Tyagi;Prateek Sharma","doi":"10.1109/TPDS.2025.3553066","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553066","url":null,"abstract":"Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called <monospace>OmniLearn</monospace> to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, <monospace>OmniLearn</monospace> reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1253-1267"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Highly-Parallel and Scalable Hardware Accelerator for the NTest Othello Game Engine","authors":"Stefan Popa;Vlad Petric;Mihai Ivanovici","doi":"10.1109/TPDS.2025.3570596","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3570596","url":null,"abstract":"Othello is a two-player combinatorial game with 1E+28 legal positions and 1E+58 game tree complexity. We propose a HIghly PArallel, Scalable and configurable hardware accelerator for evaluating the middle and endgame Othello positions. We base HIPAS on NTest - a leading software Othello engine that uses the minimax algorithm with a quality pattern-based evaluation function, alpha-beta pruning, and heuristic mobility sorting. We describe its architecture and Field Programmable Gate Array implementation, measure its performance, and compare it with prior solutions. HIPAS achieves the highest quality evaluation, the highest performance with speed-ups up to several hundreds, and the best energy efficiency. The main novelty is the algorithm implementation as a circular pipeline and a Finite State Machine with pseudo-parallel processing. Although Othello was recently claimed to be weakly solved, the game remains unsolved in a stronger sense. A weak solution only shows how to force a draw. It does not guarantee a win if the opponent makes a mistake. HIPAS can validate the weak solution faster and more efficiently. A multi-threaded NTest software component evaluating the beginning and part of the middle game, combined with one or more instances of HIPAS for handling the remainder can provide a stronger solution.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1620-1633"},"PeriodicalIF":5.6,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang
{"title":"The Design of a High-Performance Fine-Grained Deduplication Framework for Backup Storage","authors":"Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang","doi":"10.1109/TPDS.2025.3551306","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3551306","url":null,"abstract":"Fine-grained deduplication (also known as delta compression) can achieve a better deduplication ratio compared to chunk-level deduplication. This technique removes not only identical chunks but also reduces redundancies between similar but non-identical chunks. Nevertheless, it introduces considerable I/O overhead in deduplication and restore processes, hindering the performance of these two processes and rendering fine-grained deduplication less popular than chunk-level deduplication to date. In this paper, we explore various issues that lead to additional I/O overhead and tackle them using several techniques. Moreover, we introduce MeGA, which attains fine-grained deduplication/restore speed nearly equivalent to chunk-level deduplication while maintaining the significant deduplication ratio benefit of fine-grained deduplication. Specifically, MeGA employs (1) a backup-workflow-oriented delta selector and cache-centric resemblance detection to mitigate poor spatial/temporal locality in the deduplication process, and (2) a delta-friendly data layout and “Always-Forward-Reference” traversal to address poor spatial/temporal locality in the restore workflow. Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. Specifically, MeGA significantly outperforms the traditional greedy approach, providing 10–46 times better backup speed and 30–105 times more efficient restore speed, all while preserving a high deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"945-960"},"PeriodicalIF":5.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Scalable Neural Network Quantum States Method for Molecular Potential Energy Surfaces","authors":"Yangjun Wu;Wanlu Cao;Jiacheng Zhao;Honghui Shang","doi":"10.1109/TPDS.2025.3568360","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568360","url":null,"abstract":"The Neural Network Quantum States (NNQS) method is highly promising for accurately solving the Schrödinger equation, yet it encounters challenges such as computational demands and slow rates of convergence. To address the high computational requirements, we introduce optimizations including a cross-sample KV cache sharing technique to enhance sampling efficiency, Quantum Bitwise and BloomHash methods for more efficient local energy computation, and mixed-precision training strategies to boost computational efficiency. To overcome the issue of slow convergence, we propose a parallel training algorithm for NNQS under second quantization to accelerate the training of base models for molecular potential surfaces. Our approach achieves up to 27-fold acceleration specifically in local energy calculations in systems with 154 spin orbitals and demonstrates strong and weak scaling efficiencies of 98% and 97%, respectively, on the H<inline-formula><tex-math>$_{2}$</tex-math></inline-formula>O<inline-formula><tex-math>$_{2}$</tex-math></inline-formula> potential surface training set. The parallelized implementation of transformer-based NNQS is highly portable on various high-performance computing architectures, offering new perspectives on quantum chemistry simulations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1431-1443"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong
{"title":"Reinforcement Learning-Driven Adaptive Prefetch Aggressiveness Control for Enhanced Performance in Parallel System Architectures","authors":"Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong","doi":"10.1109/TPDS.2025.3550531","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550531","url":null,"abstract":"In modern parallel system architectures, prefetchers are essential to mitigating the performance challenges posed by long memory access latencies. These architectures rely heavily on efficient memory access patterns to maximize system throughput and resource utilization. Prefetch aggressiveness is a central parameter in managing these access patterns; although increased prefetch aggressiveness can enhance performance for certain applications, it often risks causing cache pollution and bandwidth contention, leading to significant performance degradation in other workloads. While many existing prefetchers rely on static or simple built-in aggressiveness controllers, a more flexible, adaptive approach based on system-level feedback is essential to achieving optimal performance across parallel computing environments. In this paper, we introduce an Adaptive Prefetch Aggressiveness Control (APAC) framework that leverages Reinforcement Learning (RL) to dynamically manage prefetch aggressiveness in parallel system architectures. The APAC controller operates as an RL agent, which optimizes prefetch aggressiveness by dynamically responding to system feedback on prefetch accuracy, timeliness, and cache pollution. The agent receives a reward signal that reflects the impact of each adjustment on both performance and memory bandwidth, learning to adapt its control strategy based on workload characteristics. This data-driven adaptability makes APAC particularly well-suited for parallel architectures, where efficient resource management across cores is essential to scaling system performance. Our evaluation with the ChampSim simulator demonstrates that APAC effectively adapts to diverse workloads and system configurations, achieving performance gains of 6.73<inline-formula><tex-math>$%$</tex-math></inline-formula> in multi-core systems compared to traditional Feedback Directed Prefetching (FDP). By improving memory bandwidth utilization, reducing cache pollution, and minimizing inter-core interference, APAC significantly enhances prefetching performance in multi-core processors. These results underscore APAC’s potential as a robust solution for performance optimization in parallel system architectures, where efficient resource management is paramount for scaling modern processing environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"977-993"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10923695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Acceleration Framework for Deep Reinforcement Learning Using Heterogeneous Systems","authors":"Yuan Meng;Mahesh A. Iyer;Viktor K. Prasanna","doi":"10.1109/TPDS.2025.3566766","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3566766","url":null,"abstract":"Deep Reinforcement Learning (DRL) is vital in various AI applications. DRL algorithms comprise diverse compute primitives, which may not be simultaneously optimized using a homogeneous architecture. However, even with available heterogeneous architectures, optimizing DRL performance remains a challenge due to the complexity of design space in parallelizing DRL primitives and the variety of hardware employed in modern data centers. To address this, we introduce a framework for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations include: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various processors and accelerators. 2. Efficient design exploration and automatic task placement enabling parallelization of tasks within each DRL primitive over one or multiple heterogeneous devices. 3. Incorporation of DRL-specific optimizations on runtime scheduling and resource allocation, facilitating parallelized training and enhancing the overall system performance. 4. High-level API for productive development using the framework. We showcase our framework through experimentation with three widely used DRL algorithms, DQN, DDPG, and SAC, on three heterogeneous platforms with diverse hardware characteristics and interconnections. The generated implementations outperform state-of-the-art libraries for CPU-GPU platforms by throughput improvements of up to 2×, and <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher performance portability across platforms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1401-1415"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144125636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian
{"title":"CiMBA: Accelerating Genome Sequencing Through On-Device Basecalling via Compute-in-Memory","authors":"William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian","doi":"10.1109/TPDS.2025.3550811","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550811","url":null,"abstract":"As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded (<inline-formula><tex-math>$sim 25$</tex-math></inline-formula> mm<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24× that required for real-time operation, and achieves 17 × /27× power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1130-1145"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li
{"title":"AlignMalloc: Warp-Aware Memory Rearrangement Aligned With UVM Prefetching for Large-Scale GPU Dynamic Allocations","authors":"Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li","doi":"10.1109/TPDS.2025.3568688","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568688","url":null,"abstract":"As parallel computing tasks rapidly expand in both complexity and scale, the need for efficient GPU dynamic memory allocation becomes increasingly important. While progress has been made in developing dynamic allocators for substantial applications, their real-world applicability is still limited due to inefficient memory access behaviors. This paper introduces AlignMalloc, a novel memory management system that aligns with the Unified Virtual Memory (UVM) prefetching strategy, significantly enhancing both memory allocation and access performance in large-scale dynamic allocation scenarios. We analyze the fundamental inefficiencies in UVM access and first reveal the mismatch between memory access and UVM prefetching methods. To resolve this issue, AlignMalloc implements a warp-aware memory rearrangement strategy that exploits the regularity of warps to align with the UVM’s static prefetching setup. Additionally, AlignMalloc introduces an OR tree-based structure within a host-co-managed framework to further optimize dynamic allocation. Comprehensive experiments demonstrate that AlignMalloc substantially outperforms current state-of-the-art systems, achieving up to <inline-formula><tex-math>$2.7 times$</tex-math></inline-formula> improvement in dynamic allocation and <inline-formula><tex-math>$2.3 times$</tex-math></inline-formula> in memory access. Additionally, eight real-world applications with diverse memory access patterns exhibit consistent performance enhancements, with average speedups <inline-formula><tex-math>$1.5 times$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1444-1459"},"PeriodicalIF":5.6,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}