{"title":"Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization","authors":"Luca Colagrande;Luca Benini","doi":"10.1109/TPDS.2025.3555718","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555718","url":null,"abstract":"Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose <italic>host</i> cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient <italic>accelerator</i> cores for data-parallel processing. Offloading a computation to the many-core acceleration fabric implies synchronization and communication overheads which can hamper overall performance and efficiency, particularly for small and fine-grained parallel tasks. In this work, we present a detailed, cycle-accurate quantitative analysis of the offload overheads on Occamy, an open-source massively parallel RISC-V based heterogeneous MPSoC. We study how the overheads scale with the number of accelerator cores. We explore an approach to drastically reduce these overheads by co-designing the hardware and the offload routines. Notably, we demonstrate that by incorporating multicast capabilities into the Network-on-Chip of a large (200+ cores) accelerator fabric we can improve offloaded application runtimes by as much as 2.3x, restoring more than 70% of the ideally attainable speedups. Finally, we propose a quantitative model to estimate the runtime of selected applications accounting for the offload overheads, with an error consistently below 15%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1193-1205"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi
{"title":"Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation","authors":"Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi","doi":"10.1109/TPDS.2025.3553092","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553092","url":null,"abstract":"High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (<inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across different types of sparse tensors. We propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>achieves <inline-formula><tex-math>$5.1times$</tex-math></inline-formula> geometric mean speedup at a fraction (25% ) of their storage costs. Moreover, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> obtains <inline-formula><tex-math>$8.4times$</tex-math></inline-formula> geometric mean speedup over the state-of-the-art memoization approach, which reduces computations by using extra memory, while requiring 14% of its memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1025-1041"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters","authors":"Wei Gao;Zhuoyuan Ouyang;Peng Sun;Tianwei Zhang;Yonggang Wen","doi":"10.1109/TPDS.2025.3553137","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553137","url":null,"abstract":"The high resource demand of deep learning training (DLT) workloads necessitates the design of efficient schedulers. While most existing schedulers expedite DLT workloads by considering GPU sharing and elastic training, they neglect <italic>layer elasticity</i>, which dynamically freezes certain layers of a network. This technique has been shown to significantly speed up individual workloads. In this paper, we explore how to incorporate <italic>layer elasticity</i> into DLT scheduler designs to achieve higher cluster-wide efficiency. A key factor that hinders the application of layer elasticity in GPU clusters is the potential loss in model accuracy, making users reluctant to enable layer elasticity for their workloads. It is necessary to have an efficient layer-elastic system, which can well balance training accuracy and speed for layer elasticity. We introduce <sc>IceFrog</small>, the first scheduling system that utilizes layer elasticity to improve the efficiency of DLT workloads in GPU clusters. It achieves this goal with superior algorithmic designs and intelligent resource management. In particular, (1) we model the frozen penalty and layer-aware throughput to measure the effective progress metric of layer-elastic workloads. (2) We design a novel scheduler to further improve the efficiency of layer elasticity. We implement and deploy <sc>IceFrog</small> in a physical cluster of 48 GPUs. Extensive evaluations and large-scale simulations show that <sc>IceFrog</small> reduces average job completion times by 36-48% relative to state-of-the-art DL schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1071-1086"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li
{"title":"GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures","authors":"Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li","doi":"10.1109/TPDS.2025.3552679","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3552679","url":null,"abstract":"GPUs are widely used hardware acceleration platforms in many areas due to their great computational throughput. In the meanwhile, GPUs are vulnerable to transient hardware faults in the post-Moore era. Analyzing the error resilience of GPUs are critical for both hardware and software. Statistical fault injection approaches are commonly used for error resilience analysis, which are highly accurate but very time consuming. In this work, we propose GEREM, a first framework to speed up fault injection process so as to estimate the error resilience of GPU microarchitectures swiftly and precisely. We find early fault behaviors can be used to accurately predict the final outcomes of program execution. Based on this observation, we categorize the early behaviors of hardware faults into GPU Early Fault Manifestation models (EFMs). For data structures, EFMs are early propagation characteristics of faults, while for pipeline instructions, EFMs are heuristic properties of several instruction contexts. We further observe that EFMs are determined by static microarchitecture states, so we can capture them without actually simulating the program execution process under fault injections. Leveraging these observations, our GEREM framework first profiles the microarchitectural states related for EFMs at one time. It then injects faults into the profiled traces to immediately generate EFMs. For data storage structures, EFMs are directly used to predict final fault outcomes, while for pipeline instructions, machine learning is used for prediction. Evaluation results show GEREM precisely assesses the error resilience of GPU microarchitecture structures with <inline-formula><tex-math>$237times$</tex-math></inline-formula> speedup on average comparing with traditional fault injections.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1011-1024"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OmniLearn: A Framework for Distributed Deep Learning Over Heterogeneous Clusters","authors":"Sahil Tyagi;Prateek Sharma","doi":"10.1109/TPDS.2025.3553066","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553066","url":null,"abstract":"Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called <monospace>OmniLearn</monospace> to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, <monospace>OmniLearn</monospace> reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1253-1267"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang
{"title":"The Design of a High-Performance Fine-Grained Deduplication Framework for Backup Storage","authors":"Xiangyu Zou;Wen Xia;Philip Shilane;Haijun Zhang;Xuan Wang","doi":"10.1109/TPDS.2025.3551306","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3551306","url":null,"abstract":"Fine-grained deduplication (also known as delta compression) can achieve a better deduplication ratio compared to chunk-level deduplication. This technique removes not only identical chunks but also reduces redundancies between similar but non-identical chunks. Nevertheless, it introduces considerable I/O overhead in deduplication and restore processes, hindering the performance of these two processes and rendering fine-grained deduplication less popular than chunk-level deduplication to date. In this paper, we explore various issues that lead to additional I/O overhead and tackle them using several techniques. Moreover, we introduce MeGA, which attains fine-grained deduplication/restore speed nearly equivalent to chunk-level deduplication while maintaining the significant deduplication ratio benefit of fine-grained deduplication. Specifically, MeGA employs (1) a backup-workflow-oriented delta selector and cache-centric resemblance detection to mitigate poor spatial/temporal locality in the deduplication process, and (2) a delta-friendly data layout and “Always-Forward-Reference” traversal to address poor spatial/temporal locality in the restore workflow. Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. Specifically, MeGA significantly outperforms the traditional greedy approach, providing 10–46 times better backup speed and 30–105 times more efficient restore speed, all while preserving a high deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"945-960"},"PeriodicalIF":5.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong
{"title":"Reinforcement Learning-Driven Adaptive Prefetch Aggressiveness Control for Enhanced Performance in Parallel System Architectures","authors":"Huijing Yang;Juan Fang;Yumin Hou;Xing Su;Neal N. Xiong","doi":"10.1109/TPDS.2025.3550531","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550531","url":null,"abstract":"In modern parallel system architectures, prefetchers are essential to mitigating the performance challenges posed by long memory access latencies. These architectures rely heavily on efficient memory access patterns to maximize system throughput and resource utilization. Prefetch aggressiveness is a central parameter in managing these access patterns; although increased prefetch aggressiveness can enhance performance for certain applications, it often risks causing cache pollution and bandwidth contention, leading to significant performance degradation in other workloads. While many existing prefetchers rely on static or simple built-in aggressiveness controllers, a more flexible, adaptive approach based on system-level feedback is essential to achieving optimal performance across parallel computing environments. In this paper, we introduce an Adaptive Prefetch Aggressiveness Control (APAC) framework that leverages Reinforcement Learning (RL) to dynamically manage prefetch aggressiveness in parallel system architectures. The APAC controller operates as an RL agent, which optimizes prefetch aggressiveness by dynamically responding to system feedback on prefetch accuracy, timeliness, and cache pollution. The agent receives a reward signal that reflects the impact of each adjustment on both performance and memory bandwidth, learning to adapt its control strategy based on workload characteristics. This data-driven adaptability makes APAC particularly well-suited for parallel architectures, where efficient resource management across cores is essential to scaling system performance. Our evaluation with the ChampSim simulator demonstrates that APAC effectively adapts to diverse workloads and system configurations, achieving performance gains of 6.73<inline-formula><tex-math>$%$</tex-math></inline-formula> in multi-core systems compared to traditional Feedback Directed Prefetching (FDP). By improving memory bandwidth utilization, reducing cache pollution, and minimizing inter-core interference, APAC significantly enhances prefetching performance in multi-core processors. These results underscore APAC’s potential as a robust solution for performance optimization in parallel system architectures, where efficient resource management is paramount for scaling modern processing environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"977-993"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10923695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian
{"title":"CiMBA: Accelerating Genome Sequencing Through On-Device Basecalling via Compute-in-Memory","authors":"William Andrew Simon;Irem Boybat;Riselda Kodra;Elena Ferro;Gagandeep Singh;Mohammed Alser;Shubham Jain;Hsinyu Tsai;Geoffrey W. Burr;Onur Mutlu;Abu Sebastian","doi":"10.1109/TPDS.2025.3550811","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3550811","url":null,"abstract":"As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded (<inline-formula><tex-math>$sim 25$</tex-math></inline-formula> mm<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24× that required for real-time operation, and achieves 17 × /27× power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1130-1145"},"PeriodicalIF":5.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graphite: Hardware-Aware GNN Reshaping for Acceleration With GPU Tensor Cores","authors":"Hyeonjin Kim;Taesoo Lim;William J. Song","doi":"10.1109/TPDS.2025.3549180","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3549180","url":null,"abstract":"Graph neural networks (GNNs) have emerged as powerful tools for addressing non-euclidean problems. GNNs operate through two key execution phases: i) aggregation and ii) combination. In the aggregation phase, the feature data of neighboring graph nodes are gathered, which is expressed as sparse-dense matrix multiplication (SpMM) between an adjacency matrix and a feature embedding table. The combination phase takes the aggregated feature embedding as input to a neural network model with learnable weights. Typically, the adjacency matrix is extremely sparse due to inherent graph structures, making the aggregation phase a significant bottleneck in GNN computations. This paper introduces <italic>Graphite</i>, a GNN acceleration framework to overcome the challenge of SpMM operations and enable graphics processing units (GPUs) to exploit massive thread-level parallelism more efficiently via existing dense acceleration units (i.e., tensor cores). To that end, Graphite employs three techniques for GNN acceleration. First, <italic>hardware-aware sparse graph reshaping (HAS)</i> rearranges graph structures to replace sparse operations with dense computations, enabling hardware acceleration through GPU tensor cores. Additionally, <italic>balanced thread block scheduling (BTS)</i> distributes sparse thread blocks evenly across streaming multiprocessors in GPUs, and <italic>zero-aware warp skipping (ZAWS)</i> eliminates ineffective threads that operate on meaningless zeros. Experimental results show that Graphite achieves an average compression rate of 84.1% for adjacency matrices using HAS. Combined with BTS and ZAWS, Graphite delivers an average 1.55x speedup over the conventional SpMM-based GNN computation method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"918-931"},"PeriodicalIF":5.6,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo
{"title":"FedLoRE: Communication-Efficient and Personalized Edge Intelligence Framework via Federated Low-Rank Estimation","authors":"Zerui Shao;Beibei Li;Peiran Wang;Yi Zhang;Kim-Kwang Raymond Choo","doi":"10.1109/TPDS.2025.3548444","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3548444","url":null,"abstract":"Federated learning (FL) has recently garnered significant attention in edge intelligence. However, FL faces two major challenges: First, statistical heterogeneity can adversely impact the performance of the global model on each client. Second, the model transmission between server and clients leads to substantial communication overhead. Previous works often suffer from the trade-off issue between these seemingly competing goals, yet we show that it is possible to address both challenges simultaneously. We propose a novel communication-efficient personalized FL framework for edge intelligence that estimates the low-rank component of the training model gradient and stores the residual component at each client. The low-rank components obtained across communication rounds have high similarity, and sharing these components with the server can significantly reduce communication overhead. Specifically, we highlight the importance of previously neglected residual components in tackling statistical heterogeneity, and retaining them locally for training model updates can effectively improve the personalization performance. Moreover, we provide a theoretical analysis of the convergence guarantee of our framework. Extensive experimental results demonstrate that our framework outperforms state-of-the-art approaches, achieving up to 89.18% reduction in communication overhead and 91.00% reduction in computation overhead while maintaining comparable personalization accuracy compared to previous works.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"994-1010"},"PeriodicalIF":5.6,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}