IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
Accelerating Communication-Efficient Federated Multi-Task Learning With Personalization and Fairness 以个性化和公平性加速具有通信效率的联合多任务学习
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-06-10 DOI: 10.1109/TPDS.2024.3411815
Renyou Xie;Chaojie Li;Xiaojun Zhou;Zhaoyang Dong
{"title":"Accelerating Communication-Efficient Federated Multi-Task Learning With Personalization and Fairness","authors":"Renyou Xie;Chaojie Li;Xiaojun Zhou;Zhaoyang Dong","doi":"10.1109/TPDS.2024.3411815","DOIUrl":"10.1109/TPDS.2024.3411815","url":null,"abstract":"Federated learning techniques provide a promising framework for collaboratively training a machine learning model without sharing users’ data, and delivering a security solution to guarantee privacy during the model training of IoT devices. Nonetheless, challenges posed by data heterogeneity and communication resource constraints make it difficult to develop an efficient federated learning algorithm in terms of the low order of convergence rate. It could significantly deteriorate the quality of service for critical machine learning tasks, e.g., facial recognition, which requires an edge-ready, low-power, low-latency training algorithm. To address these challenges, a communication-efficient federated learning approach is proposed in this paper where the momentum technique is leveraged to accelerate the convergence rate while largely reducing the communication requirements. First, a federated multi-task learning framework by which the learning tasks are reformulated by the multi-objective optimization problem is introduced to address the data heterogeneity. The multiple gradient descent algorithm is harnessed to find the common gradient descending direction for all participants so that the common features can be learned and no sacrifice on each clients’ performance. Second, to reduce communication costs, a local momentum technique with global information is developed to speed up the convergence rate, where the convergence analysis of the proposed method under non-convex case is studied. It is proved that the proposed local momentum can actually achieve the same acceleration as the global momentum, whereas it is more robust than algorithms that solely rely on the acceleration by the global momentum. Third, the generalization of the proposed acceleration approach is investigated which is demonstrated by the accelerated variation of FedAvg. Finally, the performance of the proposed method on the learning model accuracy, convergence rate, and robustness to data heterogeneity, is investigated by empirical experiments on four public datasets, while a real-world IoT platform is constructed to demonstrate the communication efficiency of the proposed method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2239-2253"},"PeriodicalIF":5.6,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KLNK: Expanding Page Boundaries in a Distributed Shared Memory System KLNK:在分布式共享内存系统中扩展页面边界
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-06-05 DOI: 10.1109/TPDS.2024.3409882
Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang
{"title":"KLNK: Expanding Page Boundaries in a Distributed Shared Memory System","authors":"Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang","doi":"10.1109/TPDS.2024.3409882","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409882","url":null,"abstract":"Software-based distributed shared memory (DSM) allows multiple processes to access shared data without the need for specialized hardware. However, this flexibility comes at a significant cost due to the need for data synchronization. One approach to mitigate these costs is to relax the consistency model, which can lead to delayed updates to the shared data. This approach typically requires the use of explicit synchronization primitives to regulate access to the shared memory and determine the timing of data synchronization. To circumvent the need for explicit synchronization, an alternative approach is to manage shared memory transparently using the underlying system. While this can simplify programming, it often imposes a fixed granularity for data sharing, which can limit the expansion of the coherence domain and increase the synchronization requirements. To overcome this limitation, we propose an abstraction called the elastic coherence domain, which dynamically adjusts the scope of data synchronization and is supported by the underlying system for transparent management of shared memory. The experimental results show that this approach can improve the efficiency of memory sharing in distributed environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1524-1535"},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FEUAGame: Fairness-Aware Edge User Allocation for App Vendors FEUAGame:面向应用程序供应商的公平感知边缘用户分配
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-06-04 DOI: 10.1109/TPDS.2024.3409548
Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He
{"title":"FEUAGame: Fairness-Aware Edge User Allocation for App Vendors","authors":"Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He","doi":"10.1109/TPDS.2024.3409548","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409548","url":null,"abstract":"Mobile edge computing (MEC) offers a new computing paradigm that turns computing and storage resources to the network edge to provide minimal service latency compared to cloud computing. Many research works have attempted to help app vendors allocate users to appropriate edge servers for high-performance service provisioning. However, existing edge user allocation (EUA) approaches have ignored fairness in users’ data rates caused by interference, which is crucial in service provisioning in the MEC environment. To pursue fairness in EUA, edge users need to be assigned to edge servers so their quality of experience can be ensured at minimum costs without significant service performance differences among them. In this paper, we make the first attempt to address this fair edge user allocation (FEUA) problem. Specifically, we formulate the FEUA problem, prove its \u0000<inline-formula><tex-math>$mathcal {NP}$</tex-math></inline-formula>\u0000-hardness, and propose an optimal approach to solve small-scale FEUA problems. To accommodate large-scale FEUA scenarios, we propose a game-theoretic approach called FEUAGame that transforms the FEUA problem into a potential game that admits a Nash equilibrium. FEUA employs a decentralized algorithm to find the Nash equilibrium in the potential game as the solution to the FEUA problem. A widely-used real-world data set is utilised to experimentally compare the performance of FEUAGame to four representative approaches. The numerical outcomes show the effectiveness and efficiency of the proposed approaches in solving the FEUA problem.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1429-1443"},"PeriodicalIF":5.6,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WASP: Efficient Power Management Enabling Workload-Aware, Self-Powered AIoT Devices WASP:高效电源管理,实现感知工作负载、自供电的人工智能物联网设备
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-06-03 DOI: 10.1109/TPDS.2024.3408167
Xiaofeng Hou;Xuehan Tang;Jiacheng Liu;Chao Li;Luhong Liang;Kwang-Ting Cheng
{"title":"WASP: Efficient Power Management Enabling Workload-Aware, Self-Powered AIoT Devices","authors":"Xiaofeng Hou;Xuehan Tang;Jiacheng Liu;Chao Li;Luhong Liang;Kwang-Ting Cheng","doi":"10.1109/TPDS.2024.3408167","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3408167","url":null,"abstract":"The wide adoption of edge AI has heightened the demand for various battery-less and maintenance-free smart systems. Nevertheless, emerging Artificial Intelligence of Things (AIoT) are complex workloads showing increased power demand, diversified power usage patterns, and unique sensitivity to power management (PM) approaches. Existing AIoT devices cannot select the most appropriate PM tuning knob, and therefore they often make sub-optimal decisions. In addition, these PM solutions always assume traditional power regulation circuit which incurs non-negligible power loss and control overhead. This can greatly compromise the potential of AIoT efficiency. In this paper, we explore power management (PM) optimization for emerging self-powered AIoT devices. We propose WASP, a highly efficient power management scheme for workload-aware, self-powered AIoT devices. The novelty of WASP is two fold. First, it combines offline profiling and light-weight online control to select the most appropriate PM tuning knobs for the given DNN models. Second, it is well tailored to a reconfigurable voltage regulation module that can make the best use of the limited power budget. Our results show that WASP allows AIoT devices to accomplish 65.6% more inference tasks under a stringent power budget without any performance degradation compared with other existing approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1400-1414"},"PeriodicalIF":5.3,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141333940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HiHGNN: Accelerating HGNNs Through Parallelism and Data Reusability Exploitation HiHGNN:通过并行性和数据可重用性开发加速 HGNNs
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-30 DOI: 10.1109/TPDS.2024.3394841
Runzhen Xue;Dengke Han;Mingyu Yan;Mo Zou;Xiaocheng Yang;Duo Wang;Wenming Li;Zhimin Tang;John Kim;Xiaochun Ye;Dongrui Fan
{"title":"HiHGNN: Accelerating HGNNs Through Parallelism and Data Reusability Exploitation","authors":"Runzhen Xue;Dengke Han;Mingyu Yan;Mo Zou;Xiaocheng Yang;Duo Wang;Wenming Li;Zhimin Tang;John Kim;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2024.3394841","DOIUrl":"10.1109/TPDS.2024.3394841","url":null,"abstract":"Heterogeneous graph neural networks (HGNNs) have emerged as powerful algorithms for processing heterogeneous graphs (HetGs), widely used in many critical fields. To capture both structural and semantic information in HetGs, HGNNs first aggregate the neighboring feature vectors for each vertex in each semantic graph and then fuse the aggregated results across all semantic graphs for each vertex. Unfortunately, existing graph neural network accelerators are ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle the specific execution patterns and exploit the high-degree parallelism as well as data reusability inside and across the processing of semantic graphs in HGNNs. In this work, we first quantitatively characterize a set of representative HGNN models on GPU to disclose the execution bound of each stage, inter-semantic-graph parallelism, and inter-semantic-graph data reusability in HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator, HiHGNN, to alleviate the execution bound and exploit the newfound parallelism and data reusability in HGNNs. Specifically, we first propose a bound-aware stage-fusion methodology that tailors to HGNN acceleration, to fuse and pipeline the execution stages being aware of their execution bounds. Second, we design an independency-aware parallel execution design to exploit the inter-semantic-graph parallelism. Finally, we present a similarity-aware execution scheduling to exploit the inter-semantic-graph data reusability. Compared to the state-of-the-art software framework running on NVIDIA GPU T4 and GPU A100, HiHGNN respectively achieves an average 40.0× and 8.3× speedup as well as 99.59% and 99.74% energy reduction with quintile the memory bandwidth of GPU A100.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 7","pages":"1122-1138"},"PeriodicalIF":5.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications TeGraph+:实现灵活边缘修改的可扩展时态图处理技术
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-26 DOI: 10.1109/TPDS.2024.3393914
Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu
{"title":"TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications","authors":"Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu","doi":"10.1109/TPDS.2024.3393914","DOIUrl":"10.1109/TPDS.2024.3393914","url":null,"abstract":"Temporal graphs are widely used for time-critical applications, which enable the extraction of graph structural information with temporal features but cannot be efficiently supported by static graph computing systems. However, the current state-of-the-art solutions for temporal graph problems are not only ad-hoc and suboptimal, but they also exhibit poor scalability, particularly in terms of their inability to scale to evolving graphs with flexible edge modifications (including insertions and deletions) and diverse execution environments. In this article, we present two key observations. First, temporal path problems can be characterized as \u0000<i>topological-optimum</i>\u0000 problems, which can be efficiently resolved using a universal single-scan execution model. Second, data redundancy in transformed temporal graphs can be mitigated by merging superfluous vertices. Building upon these fundamental insights, we propose TeGraph+, a versatile temporal graph computing engine that makes the following contributions: (1) a unified optimization strategy and execution model for temporal graph problems; (2) a novel graph transformation model with graph redundancy reduction strategy; (3) a spanning tree decomposition (STD) based distributed execution model which uses an efficient transformed graph decomposition strategy to partition the transformed graph into different spanning trees for distributed execution; (4) an efficient mixed imperative and lazy graph update strategy that offers support for evolving graphs with flexible edge modifications; (5) a general system framework with user-friendly APIs and the support of various execution environments, including in-memory, out-of-core, and distributed execution environments. Our extensive evaluation reveals that TeGraph+ can achieve up to \u0000<inline-formula><tex-math>$241times$</tex-math></inline-formula>\u0000 speedups over the state-of-the-art counterparts.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1469-1487"},"PeriodicalIF":5.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLO-Aware Function Placement for Serverless Workflows With Layer-Wise Memory Sharing 利用分层内存共享为无服务器工作流提供 SLO 感知功能布局
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-22 DOI: 10.1109/TPDS.2024.3391858
Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu
{"title":"SLO-Aware Function Placement for Serverless Workflows With Layer-Wise Memory Sharing","authors":"Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu","doi":"10.1109/TPDS.2024.3391858","DOIUrl":"10.1109/TPDS.2024.3391858","url":null,"abstract":"Function-as-a-Service (FaaS) is a promising cloud computing model known for its scalability and elasticity. In various application domains, FaaS workflows have been widely adopted to manage user requests and complete computational tasks efficiently. Motivated by the fact that function containers collaboratively use the image layer's memory, co-placing functions would leverage memory sharing to reduce cluster memory footprint, this article studies layer-wise memory sharing for serverless functions. We find that overwhelming memory sharing by placing containers in the same cluster machine may lead to performance deterioration and Service Level Objective (SLO) violations due to the increased CPU pressure. We investigate how to maximally reduce cluster memory footprint via layer-wise memory sharing for serverless workflows while guaranteeing their SLO. First, we study the container memory sharing problem under serverless workflows with a static Directed Acyclic Graph (DAG) structure. We prove it is NP-Hard and propose a 2-approximation algorithm, namely MDP. Then we consider workflows with dynamic DAG structure scenarios, where the memory sharing problem is also NP-Hard. We design a Greedy-based algorithm called GSP to address this issue. We implement a carefully designed prototype on the OpenWhisk platform, and our evaluation results demonstrate that both MDP and GSP achieve a balanced and satisfying state, effectively reducing up to 63\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 of cache memory usage while guaranteeing serverless workflow SLO.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"919-936"},"PeriodicalIF":5.3,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction 在异构系统上高效利用多线程并行性实现稀疏张量收缩
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-19 DOI: 10.1109/TPDS.2024.3391254
Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li
{"title":"Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction","authors":"Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2024.3391254","DOIUrl":"10.1109/TPDS.2024.3391254","url":null,"abstract":"Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sparse tensor contraction (SpTC) which has the challenges of index matching and accumulation. To address these difficulties, we present GSpTC, an efficient element-wise SpTC framework on CPU-GPU heterogeneous systems. GSpTC first introduces a fine-grained partitioning strategy based on element-wise tensor contraction. By analyzing and selecting appropriate dimension partitioning strategies, we can efficiently utilize the multi-threading parallelism on GPUs and optimize the overall performance of GSpTC. In particular, GSpTC leverages multi-threading parallelism on GPUs for the contraction phase and merging phase, which greatly accelerates the computation phase in sparse tensor contraction computations. Furthermore, GSpTC employs parallel pipeline technology to hide the data transmission time between the host and the device, further enhancing its performance. As a result, GSpTC achieves an average performance improvement of 267% compared to the previous state-of-the-art framework Sparta.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"889-900"},"PeriodicalIF":5.3,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems 并行文件系统一致性模型的正式定义和性能比较
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-18 DOI: 10.1109/TPDS.2024.3391058
Chen Wang;Kathryn Mohror;Marc Snir
{"title":"Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems","authors":"Chen Wang;Kathryn Mohror;Marc Snir","doi":"10.1109/TPDS.2024.3391058","DOIUrl":"10.1109/TPDS.2024.3391058","url":null,"abstract":"The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that are typically found in deep learning applications, session consistency achieved a 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"937-951"},"PeriodicalIF":5.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters 基于采样的异构深度学习集群多任务分配
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-17 DOI: 10.1109/TPDS.2024.3390109
Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan
{"title":"Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters","authors":"Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan","doi":"10.1109/TPDS.2024.3390109","DOIUrl":"10.1109/TPDS.2024.3390109","url":null,"abstract":"Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"874-888"},"PeriodicalIF":5.3,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信