{"title":"Accelerating Communication-Efficient Federated Multi-Task Learning With Personalization and Fairness","authors":"Renyou Xie;Chaojie Li;Xiaojun Zhou;Zhaoyang Dong","doi":"10.1109/TPDS.2024.3411815","DOIUrl":"10.1109/TPDS.2024.3411815","url":null,"abstract":"Federated learning techniques provide a promising framework for collaboratively training a machine learning model without sharing users’ data, and delivering a security solution to guarantee privacy during the model training of IoT devices. Nonetheless, challenges posed by data heterogeneity and communication resource constraints make it difficult to develop an efficient federated learning algorithm in terms of the low order of convergence rate. It could significantly deteriorate the quality of service for critical machine learning tasks, e.g., facial recognition, which requires an edge-ready, low-power, low-latency training algorithm. To address these challenges, a communication-efficient federated learning approach is proposed in this paper where the momentum technique is leveraged to accelerate the convergence rate while largely reducing the communication requirements. First, a federated multi-task learning framework by which the learning tasks are reformulated by the multi-objective optimization problem is introduced to address the data heterogeneity. The multiple gradient descent algorithm is harnessed to find the common gradient descending direction for all participants so that the common features can be learned and no sacrifice on each clients’ performance. Second, to reduce communication costs, a local momentum technique with global information is developed to speed up the convergence rate, where the convergence analysis of the proposed method under non-convex case is studied. It is proved that the proposed local momentum can actually achieve the same acceleration as the global momentum, whereas it is more robust than algorithms that solely rely on the acceleration by the global momentum. Third, the generalization of the proposed acceleration approach is investigated which is demonstrated by the accelerated variation of FedAvg. Finally, the performance of the proposed method on the learning model accuracy, convergence rate, and robustness to data heterogeneity, is investigated by empirical experiments on four public datasets, while a real-world IoT platform is constructed to demonstrate the communication efficiency of the proposed method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2239-2253"},"PeriodicalIF":5.6,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang
{"title":"KLNK: Expanding Page Boundaries in a Distributed Shared Memory System","authors":"Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang","doi":"10.1109/TPDS.2024.3409882","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409882","url":null,"abstract":"Software-based distributed shared memory (DSM) allows multiple processes to access shared data without the need for specialized hardware. However, this flexibility comes at a significant cost due to the need for data synchronization. One approach to mitigate these costs is to relax the consistency model, which can lead to delayed updates to the shared data. This approach typically requires the use of explicit synchronization primitives to regulate access to the shared memory and determine the timing of data synchronization. To circumvent the need for explicit synchronization, an alternative approach is to manage shared memory transparently using the underlying system. While this can simplify programming, it often imposes a fixed granularity for data sharing, which can limit the expansion of the coherence domain and increase the synchronization requirements. To overcome this limitation, we propose an abstraction called the elastic coherence domain, which dynamically adjusts the scope of data synchronization and is supported by the underlying system for transparent management of shared memory. The experimental results show that this approach can improve the efficiency of memory sharing in distributed environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1524-1535"},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He
{"title":"FEUAGame: Fairness-Aware Edge User Allocation for App Vendors","authors":"Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He","doi":"10.1109/TPDS.2024.3409548","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409548","url":null,"abstract":"Mobile edge computing (MEC) offers a new computing paradigm that turns computing and storage resources to the network edge to provide minimal service latency compared to cloud computing. Many research works have attempted to help app vendors allocate users to appropriate edge servers for high-performance service provisioning. However, existing edge user allocation (EUA) approaches have ignored fairness in users’ data rates caused by interference, which is crucial in service provisioning in the MEC environment. To pursue fairness in EUA, edge users need to be assigned to edge servers so their quality of experience can be ensured at minimum costs without significant service performance differences among them. In this paper, we make the first attempt to address this fair edge user allocation (FEUA) problem. Specifically, we formulate the FEUA problem, prove its \u0000<inline-formula><tex-math>$mathcal {NP}$</tex-math></inline-formula>\u0000-hardness, and propose an optimal approach to solve small-scale FEUA problems. To accommodate large-scale FEUA scenarios, we propose a game-theoretic approach called FEUAGame that transforms the FEUA problem into a potential game that admits a Nash equilibrium. FEUA employs a decentralized algorithm to find the Nash equilibrium in the potential game as the solution to the FEUA problem. A widely-used real-world data set is utilised to experimentally compare the performance of FEUAGame to four representative approaches. The numerical outcomes show the effectiveness and efficiency of the proposed approaches in solving the FEUA problem.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1429-1443"},"PeriodicalIF":5.6,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WASP: Efficient Power Management Enabling Workload-Aware, Self-Powered AIoT Devices","authors":"Xiaofeng Hou;Xuehan Tang;Jiacheng Liu;Chao Li;Luhong Liang;Kwang-Ting Cheng","doi":"10.1109/TPDS.2024.3408167","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3408167","url":null,"abstract":"The wide adoption of edge AI has heightened the demand for various battery-less and maintenance-free smart systems. Nevertheless, emerging Artificial Intelligence of Things (AIoT) are complex workloads showing increased power demand, diversified power usage patterns, and unique sensitivity to power management (PM) approaches. Existing AIoT devices cannot select the most appropriate PM tuning knob, and therefore they often make sub-optimal decisions. In addition, these PM solutions always assume traditional power regulation circuit which incurs non-negligible power loss and control overhead. This can greatly compromise the potential of AIoT efficiency. In this paper, we explore power management (PM) optimization for emerging self-powered AIoT devices. We propose WASP, a highly efficient power management scheme for workload-aware, self-powered AIoT devices. The novelty of WASP is two fold. First, it combines offline profiling and light-weight online control to select the most appropriate PM tuning knobs for the given DNN models. Second, it is well tailored to a reconfigurable voltage regulation module that can make the best use of the limited power budget. Our results show that WASP allows AIoT devices to accomplish 65.6% more inference tasks under a stringent power budget without any performance degradation compared with other existing approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1400-1414"},"PeriodicalIF":5.3,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141333940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiHGNN: Accelerating HGNNs Through Parallelism and Data Reusability Exploitation","authors":"Runzhen Xue;Dengke Han;Mingyu Yan;Mo Zou;Xiaocheng Yang;Duo Wang;Wenming Li;Zhimin Tang;John Kim;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2024.3394841","DOIUrl":"10.1109/TPDS.2024.3394841","url":null,"abstract":"Heterogeneous graph neural networks (HGNNs) have emerged as powerful algorithms for processing heterogeneous graphs (HetGs), widely used in many critical fields. To capture both structural and semantic information in HetGs, HGNNs first aggregate the neighboring feature vectors for each vertex in each semantic graph and then fuse the aggregated results across all semantic graphs for each vertex. Unfortunately, existing graph neural network accelerators are ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle the specific execution patterns and exploit the high-degree parallelism as well as data reusability inside and across the processing of semantic graphs in HGNNs. In this work, we first quantitatively characterize a set of representative HGNN models on GPU to disclose the execution bound of each stage, inter-semantic-graph parallelism, and inter-semantic-graph data reusability in HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator, HiHGNN, to alleviate the execution bound and exploit the newfound parallelism and data reusability in HGNNs. Specifically, we first propose a bound-aware stage-fusion methodology that tailors to HGNN acceleration, to fuse and pipeline the execution stages being aware of their execution bounds. Second, we design an independency-aware parallel execution design to exploit the inter-semantic-graph parallelism. Finally, we present a similarity-aware execution scheduling to exploit the inter-semantic-graph data reusability. Compared to the state-of-the-art software framework running on NVIDIA GPU T4 and GPU A100, HiHGNN respectively achieves an average 40.0× and 8.3× speedup as well as 99.59% and 99.74% energy reduction with quintile the memory bandwidth of GPU A100.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 7","pages":"1122-1138"},"PeriodicalIF":5.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu
{"title":"TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications","authors":"Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu","doi":"10.1109/TPDS.2024.3393914","DOIUrl":"10.1109/TPDS.2024.3393914","url":null,"abstract":"Temporal graphs are widely used for time-critical applications, which enable the extraction of graph structural information with temporal features but cannot be efficiently supported by static graph computing systems. However, the current state-of-the-art solutions for temporal graph problems are not only ad-hoc and suboptimal, but they also exhibit poor scalability, particularly in terms of their inability to scale to evolving graphs with flexible edge modifications (including insertions and deletions) and diverse execution environments. In this article, we present two key observations. First, temporal path problems can be characterized as \u0000<i>topological-optimum</i>\u0000 problems, which can be efficiently resolved using a universal single-scan execution model. Second, data redundancy in transformed temporal graphs can be mitigated by merging superfluous vertices. Building upon these fundamental insights, we propose TeGraph+, a versatile temporal graph computing engine that makes the following contributions: (1) a unified optimization strategy and execution model for temporal graph problems; (2) a novel graph transformation model with graph redundancy reduction strategy; (3) a spanning tree decomposition (STD) based distributed execution model which uses an efficient transformed graph decomposition strategy to partition the transformed graph into different spanning trees for distributed execution; (4) an efficient mixed imperative and lazy graph update strategy that offers support for evolving graphs with flexible edge modifications; (5) a general system framework with user-friendly APIs and the support of various execution environments, including in-memory, out-of-core, and distributed execution environments. Our extensive evaluation reveals that TeGraph+ can achieve up to \u0000<inline-formula><tex-math>$241times$</tex-math></inline-formula>\u0000 speedups over the state-of-the-art counterparts.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1469-1487"},"PeriodicalIF":5.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu
{"title":"SLO-Aware Function Placement for Serverless Workflows With Layer-Wise Memory Sharing","authors":"Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu","doi":"10.1109/TPDS.2024.3391858","DOIUrl":"10.1109/TPDS.2024.3391858","url":null,"abstract":"Function-as-a-Service (FaaS) is a promising cloud computing model known for its scalability and elasticity. In various application domains, FaaS workflows have been widely adopted to manage user requests and complete computational tasks efficiently. Motivated by the fact that function containers collaboratively use the image layer's memory, co-placing functions would leverage memory sharing to reduce cluster memory footprint, this article studies layer-wise memory sharing for serverless functions. We find that overwhelming memory sharing by placing containers in the same cluster machine may lead to performance deterioration and Service Level Objective (SLO) violations due to the increased CPU pressure. We investigate how to maximally reduce cluster memory footprint via layer-wise memory sharing for serverless workflows while guaranteeing their SLO. First, we study the container memory sharing problem under serverless workflows with a static Directed Acyclic Graph (DAG) structure. We prove it is NP-Hard and propose a 2-approximation algorithm, namely MDP. Then we consider workflows with dynamic DAG structure scenarios, where the memory sharing problem is also NP-Hard. We design a Greedy-based algorithm called GSP to address this issue. We implement a carefully designed prototype on the OpenWhisk platform, and our evaluation results demonstrate that both MDP and GSP achieve a balanced and satisfying state, effectively reducing up to 63\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 of cache memory usage while guaranteeing serverless workflow SLO.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"919-936"},"PeriodicalIF":5.3,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li
{"title":"Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction","authors":"Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2024.3391254","DOIUrl":"10.1109/TPDS.2024.3391254","url":null,"abstract":"Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sparse tensor contraction (SpTC) which has the challenges of index matching and accumulation. To address these difficulties, we present GSpTC, an efficient element-wise SpTC framework on CPU-GPU heterogeneous systems. GSpTC first introduces a fine-grained partitioning strategy based on element-wise tensor contraction. By analyzing and selecting appropriate dimension partitioning strategies, we can efficiently utilize the multi-threading parallelism on GPUs and optimize the overall performance of GSpTC. In particular, GSpTC leverages multi-threading parallelism on GPUs for the contraction phase and merging phase, which greatly accelerates the computation phase in sparse tensor contraction computations. Furthermore, GSpTC employs parallel pipeline technology to hide the data transmission time between the host and the device, further enhancing its performance. As a result, GSpTC achieves an average performance improvement of 267% compared to the previous state-of-the-art framework Sparta.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"889-900"},"PeriodicalIF":5.3,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems","authors":"Chen Wang;Kathryn Mohror;Marc Snir","doi":"10.1109/TPDS.2024.3391058","DOIUrl":"10.1109/TPDS.2024.3391058","url":null,"abstract":"The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that are typically found in deep learning applications, session consistency achieved a 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"937-951"},"PeriodicalIF":5.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan
{"title":"Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters","authors":"Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan","doi":"10.1109/TPDS.2024.3390109","DOIUrl":"10.1109/TPDS.2024.3390109","url":null,"abstract":"Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"874-888"},"PeriodicalIF":5.3,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}