{"title":"Courier: A Unified Communication Agent to Support Concurrent Flow Scheduling in Cluster Computing","authors":"Zhaochen Zhang;Xu Zhang;Zhaoxiang Bao;Liang Wei;Chaohong Tan;Wanchun Dou;Guihai Chen;Chen Tian","doi":"10.1109/TPDS.2025.3543882","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3543882","url":null,"abstract":"As one of the pillars in cluster computing frameworks, coflow scheduling algorithms can effectively shorten the network transmission time of cluster computing jobs, thus reducing the job completion times and improving the execution performance. However, most of existing coflow scheduling algorithms failed to consider the influences of concurrent flows, which can degrade their performance under a massive number of concurrent flows. To fill the gap, we propose a unified communication agent named Courier to minimize the number of concurrent flows in cluster computing applications, which is compatible with the mainstream coflow scheduling approaches. To maintain the scheduling order given by the scheduling algorithms, Courier merges multiple flows between each pair of hosts into a unified flow, and determines its order based on that of origin flows. In addition, in order to adapt to various types of topologies, Courier introduces a control mechanism to adjust the number of flows while maintaining the scheduling order. Extensive large-scale trace-driven simulations have shown that Courier is compatible with existing scheduling algorithms, and outperforms the state-of-the-art approaches by about 30% under a variety of workloads and topologies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"861-876"},"PeriodicalIF":5.6,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143821656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spread+: Scalable Model Aggregation in Federated Learning With Non-IID Data","authors":"Huanghuang Liang;Xin Yang;Xiaoming Han;Boan Liu;Chuang Hu;Dan Wang;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2025.3539738","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3539738","url":null,"abstract":"Federated learning (FL) addresses privacy concerns by training models without sharing raw data, overcoming the limitations of traditional machine learning paradigms. However, the rise of smart applications has accentuated the heterogeneity in data and devices, which presents significant challenges for FL. In particular, data skewness among participants can compromise model accuracy, while diverse device capabilities lead to aggregation bottlenecks, causing severe model congestion. In this article, we introduce Spread+, a hierarchical system that enhances FL by organizing clients into clusters and delegating model aggregation to edge devices, thus mitigating these challenges. Spread+ leverages hedonic coalition formation game to optimize customer organization and adaptive algorithms to regulate aggregation intervals within and across clusters. Moreover, it refines the aggregation algorithm to boost model accuracy. Our experiments demonstrate that Spread+ significantly alleviates the central aggregation bottleneck and surpasses mainstream benchmarks, achieving performance improvements of 49.58% over FAVG and 22.78% over Ring-allreduce.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"701-716"},"PeriodicalIF":5.6,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Libfork: Portable Continuation-Stealing With Stackless Coroutines","authors":"Conor J. Williams;James Elliott","doi":"10.1109/TPDS.2025.3543442","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3543442","url":null,"abstract":"Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time-scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation-stealing in traditional High Performance Computing (HPC) languages – where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless-coroutines (a new feature in C<b>++</b><inline-formula><tex-math>$bm {20}$</tex-math></inline-formula>) can enable fully-portable continuation stealing and present <i>libfork</i> a wait-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average <inline-formula><tex-math>$7.2times$</tex-math></inline-formula> faster and consumes <inline-formula><tex-math>$10times$</tex-math></inline-formula> less memory. Similarly, compared to Intel's TBB, libfork is on average <inline-formula><tex-math>$2.7times$</tex-math></inline-formula> faster and consumes <inline-formula><tex-math>$6.2times$</tex-math></inline-formula> less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching <i>busy-waiting</i> schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"877-888"},"PeriodicalIF":5.6,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedTune-SGM: A Stackelberg-Driven Personalized Federated Learning Strategy for Edge Networks","authors":"Neha Singh;Mainak Adhikari","doi":"10.1109/TPDS.2025.3543368","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3543368","url":null,"abstract":"Federated Learning (FL) has emerged as a prominent solution for distributed learning environments, enabling collaborative model training without centralized data collection. However, FL faces significant challenges such as data heterogeneity and resource-constraint edge devices for model training and analysis, leading to accuracy degradation and bias in model performance. To address these critical issues, we propose a novel FL strategy named FedTune-SGM, designed to optimize model training in decentralized settings. In this strategy, a cloud-based model is initially trained and fine-tuned on the edge devices with additional layers tailored to the specific data characteristics. This fine-tuning process effectively mitigates the impact of data heterogeneity, enhancing the robustness and generalization capability of the model. FedTune-SGM employs a strategic weighting mechanism that ensures a balanced and equitable contribution from participating edge devices to prevent dominant influences from resource-rich devices and promote a fairer and more accurate aggregated model. Additionally, the proposed strategy integrates a Stackelberg Game model to foster an interactive and dynamic cloud-edge setup that motivates edge devices to invest more effort in model training and ensures the effectiveness of resource-constraint edge devices. Extensive experiments conducted on three diverse datasets highlight the superior performance of the proposed FedTune-SGM strategy compared to state-of-the-art FL techniques in terms of accuracy and robustness while meeting the critical challenges of data heterogeneity and resource limitations in FL environments. Through these innovations, FedTune-SGM paves the way for more reliable and efficient distributed learning systems, unlocking the full potential of FL in practical applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"791-802"},"PeriodicalIF":5.6,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Tail Latency SLO Guaranteed Task Scheduling Scheme for User-Facing Services","authors":"Zhijun Wang;Huiyang Li;Lin Sun;Stoddard Rosenkrantz;Hao Che;Hong Jiang","doi":"10.1109/TPDS.2025.3542638","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3542638","url":null,"abstract":"A primary design objective for user-facing services for cloud and edge computing is to maximize query throughput, while meeting query tail latency Service Level Objectives (SLOs) for individual queries. Unfortunately, the existing solutions fall short of achieving this design objective, which we argue, is largely attributed to the fact that they fail to take the query fanout explicitly into account. In this paper, we propose TailGuard based on a Tail-latency-SLO-and-Fanout-aware Earliest-Deadline-First Queuing policy (TF-EDFQ) for task queuing at individual task servers the query tasks are fanned out to. With the task pre-dequeuing time deadline for each task being derived based on both query tail latency SLO and query fanout, TailGuard takes an important first step towards achieving the design objective. A query admission control scheme is also developed to provide tail latency SLO guarantee in the presence of resource shortages. TailGuard is evaluated against First-In-First-Out (FIFO) task queuing, task PRIority Queuing (PRIQ) and Tail-latency-SLO-aware EDFQ (T-EDFQ) policies by both simulation and testing in the Amazon EC2 cloud. It is driven by three types of applications in the Tailbench benchmark suite, featuring web search, in-memory key-value store, and transactional database applications. The results demonstrate that TailGuard can significantly improve resource utilization (e.g., up to 80% compared to FIFO), while also meeting the targeted tail latency SLOs, as compared with the other three policies. TailGuard is also implemented and tested in a highly heterogeneous Sensing-<inline-formula><tex-math>$a$</tex-math></inline-formula>s-a-Service (SaS) testbed for a data sensing service, demonstrating performance gains of up to 33% . These results are consistent with both the simulation and Amazon EC2 results.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"759-774"},"PeriodicalIF":5.6,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luiz Gustavo C. Xavier;Cristina Meinhardt;Odorico Machado Mendizabal
{"title":"Beelog: Online Log Compaction for Dependable Systems","authors":"Luiz Gustavo C. Xavier;Cristina Meinhardt;Odorico Machado Mendizabal","doi":"10.1109/TPDS.2025.3541628","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3541628","url":null,"abstract":"Logs are a known abstraction used to develop dependable and secure distributed systems. By logging entries on a sequential global log, systems can synchronize updates over replicas and provide a consistent state recovery in the presence of faults. However, their usage incurs a non-negligible overhead on the application's performance. This article presents Beelog, an approach to reduce logging impact and accelerate recovery on log-based protocols by safely discarding entries from logs. The technique involves executing a log compaction during run-time concurrently with the persistence and execution of commands. Besides compacting logging information, the proposed technique splits the log file and incorporates strategies to reduce logging overhead, such as batching and parallel I/O. We evaluate the proposed approach by implementing it as a new feature of the etcd key-value store and comparing it against etcd's standard logging. Utilizing workloads from the YCSB benchmark and experimenting with different configurations for batch size and number of storage devices, our results indicate that Beelog can reduce application recovery time, especially in write-intensive workloads with a small number of keys and a probability favoring the most recent keys to be updated. In such scenarios, we observed up to a 50% compaction in the log file size and a 65% improvement in recovery time compared to etcd's standard recovery protocol. As a side effect, batching results in higher command execution latency, ranging from <inline-formula><tex-math>$ text{100 ms}$</tex-math></inline-formula> to <inline-formula><tex-math>$ text{350 ms}$</tex-math></inline-formula> with Beelog, compared to the default etcd's <inline-formula><tex-math>$ text{90 ms}$</tex-math></inline-formula>. Except for the latency increase, the proposed technique does not impose other significant performance costs, making it a practical solution for systems where fast recovery and reduced storage are priorities.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"689-700"},"PeriodicalIF":5.6,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143521353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy Efficient and Multi-Resource Optimization for Virtual Machine Placement by Improving MOEA/D","authors":"Wenting Wei;Huaxi Gu;Zhe Xiao;Yi Chen","doi":"10.1109/TPDS.2025.3538525","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3538525","url":null,"abstract":"The explosive growth of cloud services has led to the widespread construction of large-scale data centers to meet diverse and multifaceted cloud computing demands. However, this expansion has resulted in substantial energy consumption. Virtual machine placement (VMP) has been extensively studied as a means to provide flexible and scalable cloud services while optimizing energy efficiency. Yet, the increasing complexity and diversity of applications have posted VMP suffering from waste of resources and bottlenecks due to unbalanced utilization of multi-dimensional resources. To address these issues, this article proposes a bi-objective optimization model for VMP that jointly optimizes power consumption and multi-dimensional resource utilization. Solving this large-scale bi-objective model presents a significant challenge in balancing performance and computational complexity. To tackle this, an enhanced decomposition-based multi-objective evolutionary algorithm (MOEA/D) based on <inline-formula><tex-math>$varepsilon$</tex-math></inline-formula>-domination, termed <inline-formula><tex-math>$varepsilon$</tex-math></inline-formula>-IMOEA/D-M2M is designed to provide solutions for the proposed optimization. Compared with both heuristics and evolutionary algorithms, performance evaluations demonstrate that our proposed VMP algorithm effectively reduces power consumption and balances multidimensional resource utilization while significantly decreasing running time compared to both heuristic and traditional evolutionary algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1087-1099"},"PeriodicalIF":5.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Cong;Zhiwei Zhao;Mengfan Wang;Geyong Min;Jiangshu Liu;Jiwei Mo
{"title":"Task-Aware Service Placement for Distributed Learning in Wireless Edge Networks","authors":"Rong Cong;Zhiwei Zhao;Mengfan Wang;Geyong Min;Jiangshu Liu;Jiwei Mo","doi":"10.1109/TPDS.2025.3539620","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3539620","url":null,"abstract":"Machine learning has been a driving force in the evolution of tremendous computing services and applications in the past decade. Traditional learning systems rely on centralized training and inference, which poses serious privacy and security concerns. To solve this problem, distributed learning over wireless edge networks (DLWENs) emerges as a trending solution and has attracted increasing research interests. In DLWENs, corresponding services need to be placed onto the edge servers to process the distributed tasks. Apparently, different placement of training services can significantly affect the performance of all distributed learning tasks. In this article, we propose TASP, a task-aware service placement scheme for distributed learning in wireless edge networks. By carefully considering the structures (directed acyclic graphs) of the distributed learning tasks, the fine-grained task requests and inter-task dependencies are incorporated into the placement strategies to realize the parallel computation of learning services. We also exploit queuing theory to characterize the dynamics caused by task uncertainties. Extensive experiments based on the Alibaba ML dataset show that, compared to the state-of-the-art schemes, the proposed work reduces the overall delay of distributed learning tasks by 38.6% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"731-744"},"PeriodicalIF":5.6,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zeng;Chengchuang Huang;Yipeng Mei;Lifu Zhang;Teng Su;Wei Ye;Wenqi Shi;Shengnan Wang
{"title":"EfficientMoE: Optimizing Mixture-of-Experts Model Training With Adaptive Load Balance","authors":"Yan Zeng;Chengchuang Huang;Yipeng Mei;Lifu Zhang;Teng Su;Wei Ye;Wenqi Shi;Shengnan Wang","doi":"10.1109/TPDS.2025.3539297","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3539297","url":null,"abstract":"Mixture-of-Experts (MoE) efficiently trains large models by using sparse activation to lower costs, selecting a few experts based on data characteristics. However, it faces challenges such as All-to-All communication overhead and load imbalance, with most optimizations targeting dynamic graphs rather than the more efficient static graphs. This study identifies two key challenges in training MoE on static graphs: 1) excessive All-to-All communication (up to 75% of iteration time) and load imbalance (70% of tokens handled by two experts) between experts due to the sparse structure of the MoE model and the token distribution; and 2) inefficient zero-padding for static shapes, leading to unnecessary computational overhead(wasting approximately 50% of resources). Thus, EfficientMoE, a scheduling method based on expert load and data characteristics, is introduced. EfficientMoE first designs a sampler to collect real-time information about token distribution, expert load, etc. It constructs a load prediction model to evaluate expert load. Subsequently, EfficientMoE proposes a dynamic schedule strategy for experts with evaluated expert load, reducing All-to-All communication and addressing load-balancing issues. Additionally, an expert capacity model is proposed to set different capacities for replicas of hot experts before static graph compilation, minimizing computation and storage overhead caused by significant padding. This study implements EfficientMoE in MindSpore and uses 32 Ascend AI accelerators to train an MoE model with 21 billion parameters and evaluate its validity. EfficientMoE demonstrated an improvement of 30% in model training time, approximately 12% reduction in communication time, and saved 35% computational resources across different clusters, compared with Switch transformers, and the Fastermoe method for static graphs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"677-688"},"PeriodicalIF":5.6,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143521442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yudi Qiu;Lingfei Lu;Shiyan Yi;Minge Jing;Xiaoyang Zeng;Yang Kong;Yibo Fan
{"title":"Flips: A Flexible Partitioning Strategy Near Memory Processing Architecture for Recommendation System","authors":"Yudi Qiu;Lingfei Lu;Shiyan Yi;Minge Jing;Xiaoyang Zeng;Yang Kong;Yibo Fan","doi":"10.1109/TPDS.2025.3539534","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3539534","url":null,"abstract":"Personalized recommendation systems are massively deployed in production data centers. The memory-intensive embedding layers of recommendation systems are the crucial performance bottleneck, with operations manifesting as sparse memory lookups and simple reduction computations. Recent studies propose near-memory processing (NMP) architectures to speed up embedding operations by utilizing high internal memory bandwidth. However, these solutions typically employ a fixed vector partitioning strategy that fail to adapt to changes in data center deployment scenarios and lack practicality. We propose Flips, a <underline>fl</u>ex<underline>i</u>ble <underline>p</u>artitioning <underline>s</u>trategy NMP architecture that accelerates embedding layers. Flips supports more than ten partitioning strategies through hardware-software co-design. Novel hardware architectures and address mapping schemes are designed for the memory-side and host-side. We provide two approaches to determine the optimal partitioning strategy for each embedding table, enabling the architecture to accommodate changes in deployment scenarios. Importantly, Flips is decoupled from the NMP level and can utilize rank-level, bank-group-level and bank-level parallelism. In peer-level NMP evaluations, Flips outperforms state-of-the-art NMP solutions, RecNMP, TRiM, and ReCross by up to 4.0×, 4.1×, and 3.5×, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"745-758"},"PeriodicalIF":5.6,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}