{"title":"OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters","authors":"Sahil Tyagi;Prateek Sharma","doi":"10.1109/TPDS.2025.3553066","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553066","url":null,"abstract":"Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called <monospace>OmniLearn</monospace> to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, <monospace>OmniLearn</monospace> reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1253-1267"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Highly-Parallel and Scalable Hardware Accelerator for the NTest Othello Game Engine","authors":"Stefan Popa;Vlad Petric;Mihai Ivanovici","doi":"10.1109/TPDS.2025.3570596","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3570596","url":null,"abstract":"Othello is a two-player combinatorial game with 1E+28 legal positions and 1E+58 game tree complexity. We propose a HIghly PArallel, Scalable and configurable hardware accelerator for evaluating the middle and endgame Othello positions. We base HIPAS on NTest - a leading software Othello engine that uses the minimax algorithm with a quality pattern-based evaluation function, alpha-beta pruning, and heuristic mobility sorting. We describe its architecture and Field Programmable Gate Array implementation, measure its performance, and compare it with prior solutions. HIPAS achieves the highest quality evaluation, the highest performance with speed-ups up to several hundreds, and the best energy efficiency. The main novelty is the algorithm implementation as a circular pipeline and a Finite State Machine with pseudo-parallel processing. Although Othello was recently claimed to be weakly solved, the game remains unsolved in a stronger sense. A weak solution only shows how to force a draw. It does not guarantee a win if the opponent makes a mistake. HIPAS can validate the weak solution faster and more efficiently. A multi-threaded NTest software component evaluating the beginning and part of the middle game, combined with one or more instances of HIPAS for handling the remainder can provide a stronger solution.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1620-1633"},"PeriodicalIF":5.6,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang
{"title":"The Design of a High-Performance Fine-Grained Deduplication Framework for Backup Storage","authors":"Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang","doi":"10.1109/TPDS.2025.3551306","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3551306","url":null,"abstract":"Fine-grained deduplication (also known as delta compression) can achieve a better deduplication ratio compared to chunk-level deduplication. This technique removes not only identical chunks but also reduces redundancies between similar but non-identical chunks. Nevertheless, it introduces considerable I/O overhead in deduplication and restore processes, hindering the performance of these two processes and rendering fine-grained deduplication less popular than chunk-level deduplication to date. In this paper, we explore various issues that lead to additional I/O overhead and tackle them using several techniques. Moreover, we introduce MeGA, which attains fine-grained deduplication/restore speed nearly equivalent to chunk-level deduplication while maintaining the significant deduplication ratio benefit of fine-grained deduplication. Specifically, MeGA employs (1) a backup-workflow-oriented delta selector and cache-centric resemblance detection to mitigate poor spatial/temporal locality in the deduplication process, and (2) a delta-friendly data layout and “Always-Forward-Reference” traversal to address poor spatial/temporal locality in the restore workflow. Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. Specifically, MeGA significantly outperforms the traditional greedy approach, providing 10–46 times better backup speed and 30–105 times more efficient restore speed, all while preserving a high deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"945-960"},"PeriodicalIF":5.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Scalable Neural Network Quantum States Method for Molecular Potential Energy Surfaces","authors":"Yangjun Wu;Wanlu Cao;Jiacheng Zhao;Honghui Shang","doi":"10.1109/TPDS.2025.3568360","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568360","url":null,"abstract":"The Neural Network Quantum States (NNQS) method is highly promising for accurately solving the Schrödinger equation, yet it encounters challenges such as computational demands and slow rates of convergence. To address the high computational requirements, we introduce optimizations including a cross-sample KV cache sharing technique to enhance sampling efficiency, Quantum Bitwise and BloomHash methods for more efficient local energy computation, and mixed-precision training strategies to boost computational efficiency. To overcome the issue of slow convergence, we propose a parallel training algorithm for NNQS under second quantization to accelerate the training of base models for molecular potential surfaces. Our approach achieves up to 27-fold acceleration specifically in local energy calculations in systems with 154 spin orbitals and demonstrates strong and weak scaling efficiencies of 98% and 97%, respectively, on the H<inline-formula><tex-math>$_{2}$</tex-math></inline-formula>O<inline-formula><tex-math>$_{2}$</tex-math></inline-formula> potential surface training set. The parallelized implementation of transformer-based NNQS is highly portable on various high-performance computing architectures, offering new perspectives on quantum chemistry simulations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1431-1443"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong
{"title":"Reinforcement Learning-Driven Adaptive Prefetch Aggressiveness Control for Enhanced Performance in Parallel System Architectures","authors":"Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong","doi":"10.1109/TPDS.2025.3550531","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550531","url":null,"abstract":"In modern parallel system architectures, prefetchers are essential to mitigating the performance challenges posed by long memory access latencies. These architectures rely heavily on efficient memory access patterns to maximize system throughput and resource utilization. Prefetch aggressiveness is a central parameter in managing these access patterns; although increased prefetch aggressiveness can enhance performance for certain applications, it often risks causing cache pollution and bandwidth contention, leading to significant performance degradation in other workloads. While many existing prefetchers rely on static or simple built-in aggressiveness controllers, a more flexible, adaptive approach based on system-level feedback is essential to achieving optimal performance across parallel computing environments. In this paper, we introduce an Adaptive Prefetch Aggressiveness Control (APAC) framework that leverages Reinforcement Learning (RL) to dynamically manage prefetch aggressiveness in parallel system architectures. The APAC controller operates as an RL agent, which optimizes prefetch aggressiveness by dynamically responding to system feedback on prefetch accuracy, timeliness, and cache pollution. The agent receives a reward signal that reflects the impact of each adjustment on both performance and memory bandwidth, learning to adapt its control strategy based on workload characteristics. This data-driven adaptability makes APAC particularly well-suited for parallel architectures, where efficient resource management across cores is essential to scaling system performance. Our evaluation with the ChampSim simulator demonstrates that APAC effectively adapts to diverse workloads and system configurations, achieving performance gains of 6.73<inline-formula><tex-math>$%$</tex-math></inline-formula> in multi-core systems compared to traditional Feedback Directed Prefetching (FDP). By improving memory bandwidth utilization, reducing cache pollution, and minimizing inter-core interference, APAC significantly enhances prefetching performance in multi-core processors. These results underscore APAC’s potential as a robust solution for performance optimization in parallel system architectures, where efficient resource management is paramount for scaling modern processing environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"977-993"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10923695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Acceleration Framework for Deep Reinforcement Learning Using Heterogeneous Systems","authors":"Yuan Meng;Mahesh A. Iyer;Viktor K. Prasanna","doi":"10.1109/TPDS.2025.3566766","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3566766","url":null,"abstract":"Deep Reinforcement Learning (DRL) is vital in various AI applications. DRL algorithms comprise diverse compute primitives, which may not be simultaneously optimized using a homogeneous architecture. However, even with available heterogeneous architectures, optimizing DRL performance remains a challenge due to the complexity of design space in parallelizing DRL primitives and the variety of hardware employed in modern data centers. To address this, we introduce a framework for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations include: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various processors and accelerators. 2. Efficient design exploration and automatic task placement enabling parallelization of tasks within each DRL primitive over one or multiple heterogeneous devices. 3. Incorporation of DRL-specific optimizations on runtime scheduling and resource allocation, facilitating parallelized training and enhancing the overall system performance. 4. High-level API for productive development using the framework. We showcase our framework through experimentation with three widely used DRL algorithms, DQN, DDPG, and SAC, on three heterogeneous platforms with diverse hardware characteristics and interconnections. The generated implementations outperform state-of-the-art libraries for CPU-GPU platforms by throughput improvements of up to 2×, and <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher performance portability across platforms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1401-1415"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144125636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian
{"title":"CiMBA: Accelerating Genome Sequencing Through On-Device Basecalling via Compute-in-Memory","authors":"William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian","doi":"10.1109/TPDS.2025.3550811","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550811","url":null,"abstract":"As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded (<inline-formula><tex-math>$sim 25$</tex-math></inline-formula> mm<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24× that required for real-time operation, and achieves 17 × /27× power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1130-1145"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li
{"title":"AlignMalloc: Warp-Aware Memory Rearrangement Aligned With UVM Prefetching for Large-Scale GPU Dynamic Allocations","authors":"Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li","doi":"10.1109/TPDS.2025.3568688","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568688","url":null,"abstract":"As parallel computing tasks rapidly expand in both complexity and scale, the need for efficient GPU dynamic memory allocation becomes increasingly important. While progress has been made in developing dynamic allocators for substantial applications, their real-world applicability is still limited due to inefficient memory access behaviors. This paper introduces AlignMalloc, a novel memory management system that aligns with the Unified Virtual Memory (UVM) prefetching strategy, significantly enhancing both memory allocation and access performance in large-scale dynamic allocation scenarios. We analyze the fundamental inefficiencies in UVM access and first reveal the mismatch between memory access and UVM prefetching methods. To resolve this issue, AlignMalloc implements a warp-aware memory rearrangement strategy that exploits the regularity of warps to align with the UVM’s static prefetching setup. Additionally, AlignMalloc introduces an OR tree-based structure within a host-co-managed framework to further optimize dynamic allocation. Comprehensive experiments demonstrate that AlignMalloc substantially outperforms current state-of-the-art systems, achieving up to <inline-formula><tex-math>$2.7 times$</tex-math></inline-formula> improvement in dynamic allocation and <inline-formula><tex-math>$2.3 times$</tex-math></inline-formula> in memory access. Additionally, eight real-world applications with diverse memory access patterns exhibit consistent performance enhancements, with average speedups <inline-formula><tex-math>$1.5 times$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1444-1459"},"PeriodicalIF":5.6,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graphite: Hardware-Aware GNN Reshaping for Acceleration With GPU Tensor Cores","authors":"Hyeonjin Kim;Taesoo Lim;William J. Song","doi":"10.1109/TPDS.2025.3549180","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3549180","url":null,"abstract":"Graph neural networks (GNNs) have emerged as powerful tools for addressing non-euclidean problems. GNNs operate through two key execution phases: i) aggregation and ii) combination. In the aggregation phase, the feature data of neighboring graph nodes are gathered, which is expressed as sparse-dense matrix multiplication (SpMM) between an adjacency matrix and a feature embedding table. The combination phase takes the aggregated feature embedding as input to a neural network model with learnable weights. Typically, the adjacency matrix is extremely sparse due to inherent graph structures, making the aggregation phase a significant bottleneck in GNN computations. This paper introduces <italic>Graphite</i>, a GNN acceleration framework to overcome the challenge of SpMM operations and enable graphics processing units (GPUs) to exploit massive thread-level parallelism more efficiently via existing dense acceleration units (i.e., tensor cores). To that end, Graphite employs three techniques for GNN acceleration. First, <italic>hardware-aware sparse graph reshaping (HAS)</i> rearranges graph structures to replace sparse operations with dense computations, enabling hardware acceleration through GPU tensor cores. Additionally, <italic>balanced thread block scheduling (BTS)</i> distributes sparse thread blocks evenly across streaming multiprocessors in GPUs, and <italic>zero-aware warp skipping (ZAWS)</i> eliminates ineffective threads that operate on meaningless zeros. Experimental results show that Graphite achieves an average compression rate of 84.1% for adjacency matrices using HAS. Combined with BTS and ZAWS, Graphite delivers an average 1.55x speedup over the conventional SpMM-based GNN computation method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"918-931"},"PeriodicalIF":5.6,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo
{"title":"FedLoRE: Communication-Efficient and Personalized Edge Intelligence Framework via Federated Low-Rank Estimation","authors":"Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo","doi":"10.1109/TPDS.2025.3548444","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3548444","url":null,"abstract":"Federated learning (FL) has recently garnered significant attention in edge intelligence. However, FL faces two major challenges: First, statistical heterogeneity can adversely impact the performance of the global model on each client. Second, the model transmission between server and clients leads to substantial communication overhead. Previous works often suffer from the trade-off issue between these seemingly competing goals, yet we show that it is possible to address both challenges simultaneously. We propose a novel communication-efficient personalized FL framework for edge intelligence that estimates the low-rank component of the training model gradient and stores the residual component at each client. The low-rank components obtained across communication rounds have high similarity, and sharing these components with the server can significantly reduce communication overhead. Specifically, we highlight the importance of previously neglected residual components in tackling statistical heterogeneity, and retaining them locally for training model updates can effectively improve the personalization performance. Moreover, we provide a theoretical analysis of the convergence guarantee of our framework. Extensive experimental results demonstrate that our framework outperforms state-of-the-art approaches, achieving up to 89.18% reduction in communication overhead and 91.00% reduction in computation overhead while maintaining comparable personalization accuracy compared to previous works.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"994-1010"},"PeriodicalIF":5.6,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}