IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization 针对大规模黑箱优化的多级学习分布式进化策略
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3437688
Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi
{"title":"Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization","authors":"Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi","doi":"10.1109/TPDS.2024.3437688","DOIUrl":"10.1109/TPDS.2024.3437688","url":null,"abstract":"In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outer-ES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative step-size adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. second-order learning) and costs of our meta-framework.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2087-2101"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SR-FDIL: Synergistic Replay for Federated Domain-Incremental Learning SR-FDIL:联合领域增量学习的协同重放
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3436874
Yichen Li;Wenchao Xu;Yining Qi;Haozhao Wang;Ruixuan Li;Song Guo
{"title":"SR-FDIL: Synergistic Replay for Federated Domain-Incremental Learning","authors":"Yichen Li;Wenchao Xu;Yining Qi;Haozhao Wang;Ruixuan Li;Song Guo","doi":"10.1109/TPDS.2024.3436874","DOIUrl":"10.1109/TPDS.2024.3436874","url":null,"abstract":"Federated Learning (FL) is to allow multiple clients to collaboratively train a model while keeping their data locally. However, existing FL approaches typically assume that the data in each client is static and fixed, which cannot account for incremental data with domain shift, leading to catastrophic forgetting on previous domains, particularly when clients are common edge devices that may lack enough storage to retain full samples of each domain. To tackle this challenge, we propose \u0000<bold>F</b>\u0000ederated \u0000<bold>D</b>\u0000omain-\u0000<bold>I</b>\u0000ncremental \u0000<bold>L</b>\u0000earning via \u0000<bold>S</b>\u0000ynergistic \u0000<bold>R</b>\u0000eplay (SR-FDIL), which alleviates catastrophic forgetting by coordinating all clients to cache samples and replay them. More specifically, when new data arrives, each client selects the cached samples based not only on their importance in the local dataset but also on their correlation with the global dataset. Moreover, to achieve a balance between learning new data and memorizing old data, we propose a novel client selection mechanism by jointly considering the importance of both old and new data. We conducted extensive experiments on several datasets of which the results demonstrate that SR-FDIL outperforms state-of-the-art methods by up to 4.05% in terms of average accuracy of all domains.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1879-1890"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-Effective and Robust Service Provisioning in Multi-Access Edge Computing 在多接入边缘计算中提供经济高效且稳健的服务
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-30 DOI: 10.1109/TPDS.2024.3435929
Zhengzhe Xiang;Yuhang Zheng;Dongjing Wang;Javid Taheri;Zengwei Zheng;Minyi Guo
{"title":"Cost-Effective and Robust Service Provisioning in Multi-Access Edge Computing","authors":"Zhengzhe Xiang;Yuhang Zheng;Dongjing Wang;Javid Taheri;Zengwei Zheng;Minyi Guo","doi":"10.1109/TPDS.2024.3435929","DOIUrl":"10.1109/TPDS.2024.3435929","url":null,"abstract":"With the development of multiaccess edge computing (MEC) technology, an increasing number of researchers and developers are deploying their computation-intensive and IO-intensive services (especially AI services) on edge devices. These devices, being close to end users, provide better performance in mobile environments. By constructing a service provisioning system at the network edge, latency is significantly reduced due to short-distance communication with edge servers. However, since the MEC-based service provisioning system is resource-sensitive and the network may be unstable, careful resource allocation and traffic scheduling strategies are essential. This paper investigates and quantifies the cost-effectiveness and robustness of the MEC-based service provisioning system with the applied resource allocation and traffic scheduling strategies. Based on this analysis, a \u0000<bold>c</b>\u0000ost-\u0000<bold>e</b>\u0000ffective and \u0000<bold>r</b>\u0000obust service provisioning \u0000<bold>a</b>\u0000lgorithm, termed \u0000<monospace>CERA</monospace>\u0000, is proposed to minimize deployment costs while maintaining system robustness. Extensive experiments are conducted to compare the proposed approach with well-known baseline algorithms and evaluate factors impacting the results. The findings demonstrate that \u0000<monospace>CERA</monospace>\u0000 achieves at least 15.9% better performance than other baseline algorithms across various instances.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1765-1779"},"PeriodicalIF":5.6,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Privacy Preserving Task Push in Spatial Crowdsourcing With Unknown Popularity 在未知人气的空间众包中保护隐私的任务推送
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-29 DOI: 10.1109/TPDS.2024.3434978
Yin Xu;Mingjun Xiao;Jie Wu;He Sun
{"title":"Privacy Preserving Task Push in Spatial Crowdsourcing With Unknown Popularity","authors":"Yin Xu;Mingjun Xiao;Jie Wu;He Sun","doi":"10.1109/TPDS.2024.3434978","DOIUrl":"10.1109/TPDS.2024.3434978","url":null,"abstract":"In this paper, we investigate the privacy-preserving task push problem with unknown popularity in Spatial Crowdsourcing (SC), where the platform needs to select some tasks with unknown popularity and push them to workers. Meanwhile, the preferences of workers and the popularity values of tasks might involve some sensitive information, which should be protected from disclosure. To address these concerns, we propose a Privacy Preserving Auction-based Bandit scheme, termed PPAB. Specifically, on the basis of the Combinatorial Multi-armed Bandit (CMAB) game, we first construct a Differentially Private Auction-based CMAB (DPA-CMAB) model. Under the DPA-CMAB model, we design a privacy-preserving arm-pulling policy based on Diffie-Hellman (DH), Differential Privacy (DP), and upper confidence bound, which includes the DH-based encryption mechanism and the hybrid DP-based protection mechanism. The policy not only can learn the popularity of tasks and make online task push decisions, but also can protect the popularity as well as workers’ preferences from being revealed. Meanwhile, we design an auction-based incentive mechanism to determine the payment for each selected task. Furthermore, we conduct an in-depth analysis of the security and online performance of PPAB, and prove that PPAB satisfies some desired properties (i.e., truthfulness, individual rationality, and computational efficiency). Finally, the significant performance of PPAB is confirmed through extensive simulations on the real-world dataset.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2039-2053"},"PeriodicalIF":5.6,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A State-of-the-Art Review with Code about Connected Components Labeling on GPUs 用代码回顾 GPU 上连接组件标签的最新进展
IF 5.3 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-29 DOI: 10.1109/tpds.2024.3434357
Federico Bolelli, Stefano Allegretti, Luca Lumetti, Costantino Grana
{"title":"A State-of-the-Art Review with Code about Connected Components Labeling on GPUs","authors":"Federico Bolelli, Stefano Allegretti, Luca Lumetti, Costantino Grana","doi":"10.1109/tpds.2024.3434357","DOIUrl":"https://doi.org/10.1109/tpds.2024.3434357","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"172 1","pages":""},"PeriodicalIF":5.3,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter Array SSA:统一递归双向序列 Systolic Sorter 阵列
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-26 DOI: 10.1109/TPDS.2024.3434332
Teng Gao;Lan Huang;Shang Gao;Kangping Wang
{"title":"SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter Array","authors":"Teng Gao;Lan Huang;Shang Gao;Kangping Wang","doi":"10.1109/TPDS.2024.3434332","DOIUrl":"10.1109/TPDS.2024.3434332","url":null,"abstract":"The use of reconfigurable circuits with parallel computing capabilities has been explored to enhance sorting performance and reduce power consumption. Nonetheless, most sorting algorithms utilizing dedicated processors are designed solely based on the parallelization of the algorithm, lacking considerations of specialized hardware structures. This leads to problems, including but not limited to the consumption of excessive I/O interface resources, on-chip storage resources, and complex layout wiring. In this paper, we propose a Systolic Sorter Array, implemented by a Uniform Recurrence Equation (URE) with highly parameterised in terms of data size, bit width and type. Leveraging this uniformly recursive structure, the sorter can simultaneously sort two independent sequences. In addition, we implemented global and local control modes on the FPGA to achieve higher computational frequencies. In our experiments, we have demonstrated the speed-up ratio of SSA relative to other state of the art (SOTA) sorting algorithms using C++ \u0000<inline-formula><tex-math>$std$</tex-math></inline-formula>\u0000::\u0000<inline-formula><tex-math>$sort()$</tex-math></inline-formula>\u0000 as benchmark. Inheriting the benefits from the Systolic Array architecture, the SSA reaches up to 810 Mhz computing frequency on the U200. The results of our study show that SSA outperforms other sorting algorithms in terms of throughput, speed-up ratio, and computation frequency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1721-1734"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Long-Range MD Electrostatics Force Computation on FPGAs FPGA 上的长程 MD 静电力计算
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-26 DOI: 10.1109/TPDS.2024.3434347
Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt
{"title":"Long-Range MD Electrostatics Force Computation on FPGAs","authors":"Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt","doi":"10.1109/TPDS.2024.3434347","DOIUrl":"10.1109/TPDS.2024.3434347","url":null,"abstract":"Strong scaling of long-range electrostatic force computation, which is a central concern of long timescale molecular dynamics simulations, is challenging for CPUs and GPUs due to its complex communication structure and global communication requirements. The scalability challenge is seen especially in small simulations of tens to hundreds of thousands of atoms that are of interest to many important applications such as physics-driven drug discovery. FPGA clusters, with their direct, tightly coupled, low-latency interconnects, are able to address these requirements. For FPGA MD clusters to be effective, however, single device performance must also be competitive. In this work, we leverage the inherent benefits of FPGAs to implement a long-range electrostatic force computation architecture. We present an overall framework with numerous algorithmic, mapping, and architecture innovations, including a unified interleaved memory, a spatial scheduling algorithm, and a design for seamless integration with the larger MD system. We examine a number of alternative configurations based on different resource allocation strategies and user parameters. We show that the best configuration of this architecture, implemented on an Intel Agilex FPGA, can achieve \u0000<inline-formula><tex-math>$2124 ns$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$287 ns$</tex-math></inline-formula>\u0000 of simulated time per day of wall-clock time for the two molecular dynamics benchmarks DHFR and ApoA1; simulating 23K and 92K particles, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1690-1707"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism 利用分层流水线并行性进行无冗余和负载平衡的 TGNN 训练
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-24 DOI: 10.1109/TPDS.2024.3432855
Yaqi Xia;Zheng Zhang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Hongyang Chen;Qianlong Sang;Dazhao Cheng
{"title":"Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism","authors":"Yaqi Xia;Zheng Zhang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Hongyang Chen;Qianlong Sang;Dazhao Cheng","doi":"10.1109/TPDS.2024.3432855","DOIUrl":"10.1109/TPDS.2024.3432855","url":null,"abstract":"Recently, Temporal Graph Neural Networks (TGNNs), as an extension of Graph Neural Networks, have demonstrated remarkable effectiveness in handling dynamic graph data. Distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-device communication that generates significant redundant data. However, existing systems are unable to remove the redundancy in data reuse and transfer, and suffer from severe communication overhead in a distributed setting. This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform. Exploiting dependency patterns of TGNN models, we develop a redundancy-free graph organization to mitigate redundant data transfer. Additionally, we investigate communication imbalance issues among devices and formulate the graph partitioning problem as minimizing the maximum communication balance cost, which is proved to be an NP-hard problem. We propose an approximation algorithm called Re-FlexBiCut to tackle this problem. Furthermore, we incorporate prefetching, adaptive micro-batch pipelining, and asynchronous pipelining to present a hierarchical pipelining mechanism that mitigates the communication overhead. Sven represents the first comprehensive optimization solution for scaling memory-based TGNN training. Through extensive experiments conducted on a 64-GPU cluster, Sven demonstrates impressive speedup, ranging from 1.9x to 3.5x, compared to State-of-the-Art approaches. Additionally, Sven achieves up to 5.26x higher communication efficiency and reduces communication imbalance by up to 59.2%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1904-1919"},"PeriodicalIF":5.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs IrGEMM:面向 ARM 和 X86 CPU 上不规则 GEMM 的输入感知调整框架
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-23 DOI: 10.1109/TPDS.2024.3432579
Cunyang Wei;Haipeng Jia;Yunquan Zhang;Jianyu Yao;Chendi Li;Wenxuan Cao
{"title":"IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs","authors":"Cunyang Wei;Haipeng Jia;Yunquan Zhang;Jianyu Yao;Chendi Li;Wenxuan Cao","doi":"10.1109/TPDS.2024.3432579","DOIUrl":"10.1109/TPDS.2024.3432579","url":null,"abstract":"The matrix multiplication algorithm is a fundamental numerical technique in linear algebra and plays a crucial role in many scientific computing applications. Despite the high performance of mainstream basic linear algebra libraries for large-scale dense matrix multiplications, they exhibit poor performance when applied to matrix multiplication with irregular input. This paper proposes an input-aware tuning framework that accounts for application scenarios and computer architectures to provide high-performance irregular matrix multiplication on ARMv8 and X86 CPUs. The framework comprises two stages: the install-time stage and the run-time stage. The install-time stage utilizes our proposed computational template to generate high-performance kernels for general data layout and SIMD-friendly data layout. The run-time stage utilizes a tiling algorithm suitable for irregular GEMM to select the optimal kernel and link as an execution plan. Additionally, load-balanced multi-threaded optimization algorithms are defined to exploit the multi-threading capability of modern processors. Experiments demonstrate that the proposed IrGEMM framework can achieve significant performance improvements for irregular GEMM on both ARMv8 and X86 CPUs compared to other mainstream BLAS libraries.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1672-1689"},"PeriodicalIF":5.6,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sophisticated Orchestrating Concurrent DLRM Training on CPU/GPU Platform 在 CPU/GPU 平台上协调并行 DLRM 培训的复杂性
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-07-23 DOI: 10.1109/TPDS.2024.3432620
Rui Tian;Jiazhi Jiang;Jiangsu Du;Dan Huang;Yutong Lu
{"title":"Sophisticated Orchestrating Concurrent DLRM Training on CPU/GPU Platform","authors":"Rui Tian;Jiazhi Jiang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2024.3432620","DOIUrl":"10.1109/TPDS.2024.3432620","url":null,"abstract":"Recommendation systems are essential to the operation of the majority of internet services, with Deep Learning Recommendation Models (DLRMs) serving as a crucial component. However, due to distinct computation, data access, and memory usage characteristics of recommendation models, the trainning of DLRMs may suffer from low resource utilization on prevalent heterogeneous CPU-GPU hardware platforms. Furthermore, as the majority of high-performance computing systems presently depend on multi-GPU computing nodes, the challenge of addressing low resource utilization becomes even more pronounced. Existing concurrent training solutions cannot be straightforwardly applied to DLRM due to various factors, such as insufficient fine-grained memory management and the lack of collaborative CPU-GPU scheduling. In this paper, we introduce RMixer, a scheduling framework that addresses these challenges by providing an efficient job management and scheduling mechanism for DLRM training jobs on heterogeneous CPU-GPU platforms. To facilitate training co-location, we first estimate the peak memory consumption of each job. Additionally, we track and collect resource utilization for DLRM training jobs. Based on the information of computational patterns, a batched job dispatcher with dynamic resource-complementary scheduling policy is proposed to co-locate DLRM training jobs on CPU-GPU platform. Scheduling strategies for both intra-GPU and inter-GPU scenarios were meticulously devised, with a focus on thoroughly examining individual GPU resource utilization and achieving a balanced state across multiple GPUs. Experimental results demonstrate that our implementation achieved up to 5.3× and 7.5× higher throughput on single GPU and 4 GPU respectively for training jobs involving various recommendation models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2177-2192"},"PeriodicalIF":5.6,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信