IEEE Transactions on Parallel and Distributed Systems最新文献_第6页

OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters OmniLearn：异构集群上的分布式深度学习框架

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-18 DOI: 10.1109/TPDS.2025.3553066

Sahil Tyagi;Prateek Sharma

引用次数: 0

A Highly-Parallel and Scalable Hardware Accelerator for the NTest Othello Game Engine 用于NTest奥赛罗游戏引擎的高度并行和可扩展硬件加速器

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-15 DOI: 10.1109/TPDS.2025.3570596

Stefan Popa;Vlad Petric;Mihai Ivanovici

{"title":"A Highly-Parallel and Scalable Hardware Accelerator for the NTest Othello Game Engine","authors":"Stefan Popa;Vlad Petric;Mihai Ivanovici","doi":"10.1109/TPDS.2025.3570596","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3570596","url":null,"abstract":"Othello is a two-player combinatorial game with 1E+28 legal positions and 1E+58 game tree complexity. We propose a HIghly PArallel, Scalable and configurable hardware accelerator for evaluating the middle and endgame Othello positions. We base HIPAS on NTest - a leading software Othello engine that uses the minimax algorithm with a quality pattern-based evaluation function, alpha-beta pruning, and heuristic mobility sorting. We describe its architecture and Field Programmable Gate Array implementation, measure its performance, and compare it with prior solutions. HIPAS achieves the highest quality evaluation, the highest performance with speed-ups up to several hundreds, and the best energy efficiency. The main novelty is the algorithm implementation as a circular pipeline and a Finite State Machine with pseudo-parallel processing. Although Othello was recently claimed to be weakly solved, the game remains unsolved in a stronger sense. A weak solution only shows how to force a draw. It does not guarantee a win if the opponent makes a mistake. HIPAS can validate the weak solution faster and more efficiently. A multi-threaded NTest software component evaluating the beginning and part of the middle game, combined with one or more instances of HIPAS for handling the remainder can provide a stronger solution.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1620-1633"},"PeriodicalIF":5.6,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Design of a High-Performance Fine-Grained Deduplication Framework for Backup Storage 用于备份存储的高性能细粒度重复数据删除框架设计

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-13 DOI: 10.1109/TPDS.2025.3551306

Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang

{"title":"The Design of a High-Performance Fine-Grained Deduplication Framework for Backup Storage","authors":"Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang","doi":"10.1109/TPDS.2025.3551306","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3551306","url":null,"abstract":"Fine-grained deduplication (also known as delta compression) can achieve a better deduplication ratio compared to chunk-level deduplication. This technique removes not only identical chunks but also reduces redundancies between similar but non-identical chunks. Nevertheless, it introduces considerable I/O overhead in deduplication and restore processes, hindering the performance of these two processes and rendering fine-grained deduplication less popular than chunk-level deduplication to date. In this paper, we explore various issues that lead to additional I/O overhead and tackle them using several techniques. Moreover, we introduce MeGA, which attains fine-grained deduplication/restore speed nearly equivalent to chunk-level deduplication while maintaining the significant deduplication ratio benefit of fine-grained deduplication. Specifically, MeGA employs (1) a backup-workflow-oriented delta selector and cache-centric resemblance detection to mitigate poor spatial/temporal locality in the deduplication process, and (2) a delta-friendly data layout and “Always-Forward-Reference” traversal to address poor spatial/temporal locality in the restore workflow. Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. Specifically, MeGA significantly outperforms the traditional greedy approach, providing 10–46 times better backup speed and 30–105 times more efficient restore speed, all while preserving a high deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"945-960"},"PeriodicalIF":5.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast and Scalable Neural Network Quantum States Method for Molecular Potential Energy Surfaces 分子势能面快速可扩展神经网络量子态方法

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-12 DOI: 10.1109/TPDS.2025.3568360

Yangjun Wu;Wanlu Cao;Jiacheng Zhao;Honghui Shang

{"title":"Fast and Scalable Neural Network Quantum States Method for Molecular Potential Energy Surfaces","authors":"Yangjun Wu;Wanlu Cao;Jiacheng Zhao;Honghui Shang","doi":"10.1109/TPDS.2025.3568360","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568360","url":null,"abstract":"The Neural Network Quantum States (NNQS) method is highly promising for accurately solving the Schrödinger equation, yet it encounters challenges such as computational demands and slow rates of convergence. To address the high computational requirements, we introduce optimizations including a cross-sample KV cache sharing technique to enhance sampling efficiency, Quantum Bitwise and BloomHash methods for more efficient local energy computation, and mixed-precision training strategies to boost computational efficiency. To overcome the issue of slow convergence, we propose a parallel training algorithm for NNQS under second quantization to accelerate the training of base models for molecular potential surfaces. Our approach achieves up to 27-fold acceleration specifically in local energy calculations in systems with 154 spin orbitals and demonstrates strong and weak scaling efficiencies of 98% and 97%, respectively, on the H<inline-formula><tex-math>$_{2}$</tex-math></inline-formula>O<inline-formula><tex-math>$_{2}$</tex-math></inline-formula> potential surface training set. The parallelized implementation of transformer-based NNQS is highly portable on various high-performance computing architectures, offering new perspectives on quantum chemistry simulations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1431-1443"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reinforcement Learning-Driven Adaptive Prefetch Aggressiveness Control for Enhanced Performance in Parallel System Architectures 强化学习驱动的自适应预取攻击性控制在并行系统架构中的应用

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-12 DOI: 10.1109/TPDS.2025.3550531

Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong

{"title":"Reinforcement Learning-Driven Adaptive Prefetch Aggressiveness Control for Enhanced Performance in Parallel System Architectures","authors":"Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong","doi":"10.1109/TPDS.2025.3550531","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550531","url":null,"abstract":"In modern parallel system architectures, prefetchers are essential to mitigating the performance challenges posed by long memory access latencies. These architectures rely heavily on efficient memory access patterns to maximize system throughput and resource utilization. Prefetch aggressiveness is a central parameter in managing these access patterns; although increased prefetch aggressiveness can enhance performance for certain applications, it often risks causing cache pollution and bandwidth contention, leading to significant performance degradation in other workloads. While many existing prefetchers rely on static or simple built-in aggressiveness controllers, a more flexible, adaptive approach based on system-level feedback is essential to achieving optimal performance across parallel computing environments. In this paper, we introduce an Adaptive Prefetch Aggressiveness Control (APAC) framework that leverages Reinforcement Learning (RL) to dynamically manage prefetch aggressiveness in parallel system architectures. The APAC controller operates as an RL agent, which optimizes prefetch aggressiveness by dynamically responding to system feedback on prefetch accuracy, timeliness, and cache pollution. The agent receives a reward signal that reflects the impact of each adjustment on both performance and memory bandwidth, learning to adapt its control strategy based on workload characteristics. This data-driven adaptability makes APAC particularly well-suited for parallel architectures, where efficient resource management across cores is essential to scaling system performance. Our evaluation with the ChampSim simulator demonstrates that APAC effectively adapts to diverse workloads and system configurations, achieving performance gains of 6.73<inline-formula><tex-math>$%$</tex-math></inline-formula> in multi-core systems compared to traditional Feedback Directed Prefetching (FDP). By improving memory bandwidth utilization, reducing cache pollution, and minimizing inter-core interference, APAC significantly enhances prefetching performance in multi-core processors. These results underscore APAC’s potential as a robust solution for performance optimization in parallel system architectures, where efficient resource management is paramount for scaling modern processing environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"977-993"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10923695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Acceleration Framework for Deep Reinforcement Learning Using Heterogeneous Systems 基于异构系统的深度强化学习加速框架

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-12 DOI: 10.1109/TPDS.2025.3566766

Yuan Meng;Mahesh A. Iyer;Viktor K. Prasanna

{"title":"An Acceleration Framework for Deep Reinforcement Learning Using Heterogeneous Systems","authors":"Yuan Meng;Mahesh A. Iyer;Viktor K. Prasanna","doi":"10.1109/TPDS.2025.3566766","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3566766","url":null,"abstract":"Deep Reinforcement Learning (DRL) is vital in various AI applications. DRL algorithms comprise diverse compute primitives, which may not be simultaneously optimized using a homogeneous architecture. However, even with available heterogeneous architectures, optimizing DRL performance remains a challenge due to the complexity of design space in parallelizing DRL primitives and the variety of hardware employed in modern data centers. To address this, we introduce a framework for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations include: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various processors and accelerators. 2. Efficient design exploration and automatic task placement enabling parallelization of tasks within each DRL primitive over one or multiple heterogeneous devices. 3. Incorporation of DRL-specific optimizations on runtime scheduling and resource allocation, facilitating parallelized training and enhancing the overall system performance. 4. High-level API for productive development using the framework. We showcase our framework through experimentation with three widely used DRL algorithms, DQN, DDPG, and SAC, on three heterogeneous platforms with diverse hardware characteristics and interconnections. The generated implementations outperform state-of-the-art libraries for CPU-GPU platforms by throughput improvements of up to 2×, and <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher performance portability across platforms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1401-1415"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144125636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CiMBA: Accelerating Genome Sequencing Through On-Device Basecalling via Compute-in-Memory CiMBA：通过内存中计算的设备上基调用加速基因组测序

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-12 DOI: 10.1109/TPDS.2025.3550811

William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian

{"title":"CiMBA: Accelerating Genome Sequencing Through On-Device Basecalling via Compute-in-Memory","authors":"William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian","doi":"10.1109/TPDS.2025.3550811","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550811","url":null,"abstract":"As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded (<inline-formula><tex-math>$sim 25$</tex-math></inline-formula> mm<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24× that required for real-time operation, and achieves 17 × /27× power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1130-1145"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AlignMalloc: Warp-Aware Memory Rearrangement Aligned With UVM Prefetching for Large-Scale GPU Dynamic Allocations AlignMalloc：与大规模GPU动态分配的UVM预取对齐的扭曲感知内存重排

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-09 DOI: 10.1109/TPDS.2025.3568688

Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li

{"title":"AlignMalloc: Warp-Aware Memory Rearrangement Aligned With UVM Prefetching for Large-Scale GPU Dynamic Allocations","authors":"Jiajian Zhang;Fangyu Wu;Hai Jiang;Qiufeng Wang;Genlang Chen;Guangliang Cheng;Eng Gee Lim;Keqin Li","doi":"10.1109/TPDS.2025.3568688","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3568688","url":null,"abstract":"As parallel computing tasks rapidly expand in both complexity and scale, the need for efficient GPU dynamic memory allocation becomes increasingly important. While progress has been made in developing dynamic allocators for substantial applications, their real-world applicability is still limited due to inefficient memory access behaviors. This paper introduces AlignMalloc, a novel memory management system that aligns with the Unified Virtual Memory (UVM) prefetching strategy, significantly enhancing both memory allocation and access performance in large-scale dynamic allocation scenarios. We analyze the fundamental inefficiencies in UVM access and first reveal the mismatch between memory access and UVM prefetching methods. To resolve this issue, AlignMalloc implements a warp-aware memory rearrangement strategy that exploits the regularity of warps to align with the UVM’s static prefetching setup. Additionally, AlignMalloc introduces an OR tree-based structure within a host-co-managed framework to further optimize dynamic allocation. Comprehensive experiments demonstrate that AlignMalloc substantially outperforms current state-of-the-art systems, achieving up to <inline-formula><tex-math>$2.7 times$</tex-math></inline-formula> improvement in dynamic allocation and <inline-formula><tex-math>$2.3 times$</tex-math></inline-formula> in memory access. Additionally, eight real-world applications with diverse memory access patterns exhibit consistent performance enhancements, with average speedups <inline-formula><tex-math>$1.5 times$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1444-1459"},"PeriodicalIF":5.6,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graphite: Hardware-Aware GNN Reshaping for Acceleration With GPU Tensor Cores 石墨：GPU张量核加速的硬件感知GNN重塑

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-07 DOI: 10.1109/TPDS.2025.3549180

Hyeonjin Kim;Taesoo Lim;William J. Song

{"title":"Graphite: Hardware-Aware GNN Reshaping for Acceleration With GPU Tensor Cores","authors":"Hyeonjin Kim;Taesoo Lim;William J. Song","doi":"10.1109/TPDS.2025.3549180","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3549180","url":null,"abstract":"Graph neural networks (GNNs) have emerged as powerful tools for addressing non-euclidean problems. GNNs operate through two key execution phases: i) aggregation and ii) combination. In the aggregation phase, the feature data of neighboring graph nodes are gathered, which is expressed as sparse-dense matrix multiplication (SpMM) between an adjacency matrix and a feature embedding table. The combination phase takes the aggregated feature embedding as input to a neural network model with learnable weights. Typically, the adjacency matrix is extremely sparse due to inherent graph structures, making the aggregation phase a significant bottleneck in GNN computations. This paper introduces <italic>Graphite</i>, a GNN acceleration framework to overcome the challenge of SpMM operations and enable graphics processing units (GPUs) to exploit massive thread-level parallelism more efficiently via existing dense acceleration units (i.e., tensor cores). To that end, Graphite employs three techniques for GNN acceleration. First, <italic>hardware-aware sparse graph reshaping (HAS)</i> rearranges graph structures to replace sparse operations with dense computations, enabling hardware acceleration through GPU tensor cores. Additionally, <italic>balanced thread block scheduling (BTS)</i> distributes sparse thread blocks evenly across streaming multiprocessors in GPUs, and <italic>zero-aware warp skipping (ZAWS)</i> eliminates ineffective threads that operate on meaningless zeros. Experimental results show that Graphite achieves an average compression rate of 84.1% for adjacency matrices using HAS. Combined with BTS and ZAWS, Graphite delivers an average 1.55x speedup over the conventional SpMM-based GNN computation method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"918-931"},"PeriodicalIF":5.6,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedLoRE: Communication-Efficient and Personalized Edge Intelligence Framework via Federated Low-Rank Estimation 基于联邦低秩估计的高效通信和个性化边缘智能框架

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-06 DOI: 10.1109/TPDS.2025.3548444

Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo

{"title":"FedLoRE: Communication-Efficient and Personalized Edge Intelligence Framework via Federated Low-Rank Estimation","authors":"Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo","doi":"10.1109/TPDS.2025.3548444","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3548444","url":null,"abstract":"Federated learning (FL) has recently garnered significant attention in edge intelligence. However, FL faces two major challenges: First, statistical heterogeneity can adversely impact the performance of the global model on each client. Second, the model transmission between server and clients leads to substantial communication overhead. Previous works often suffer from the trade-off issue between these seemingly competing goals, yet we show that it is possible to address both challenges simultaneously. We propose a novel communication-efficient personalized FL framework for edge intelligence that estimates the low-rank component of the training model gradient and stores the residual component at each client. The low-rank components obtained across communication rounds have high similarity, and sharing these components with the server can significantly reduce communication overhead. Specifically, we highlight the importance of previously neglected residual components in tackling statistical heterogeneity, and retaining them locally for training model updates can effectively improve the personalization performance. Moreover, we provide a theoretical analysis of the convergence guarantee of our framework. Extensive experimental results demonstrate that our framework outperforms state-of-the-art approaches, achieving up to 89.18% reduction in communication overhead and 91.00% reduction in computation overhead while maintaining comparable personalization accuracy compared to previous works.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"994-1010"},"PeriodicalIF":5.6,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0