Parallel Computing最新文献

筛选
英文 中文
An automated OpenMP mutation testing framework for performance optimization 用于性能优化的自动 OpenMP 突变测试框架
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-21 DOI: 10.1016/j.parco.2024.103097
Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González
{"title":"An automated OpenMP mutation testing framework for performance optimization","authors":"Dolores Miao ,&nbsp;Ignacio Laguna ,&nbsp;Giorgis Georgakoudis ,&nbsp;Konstantinos Parasyris ,&nbsp;Cindy Rubio-González","doi":"10.1016/j.parco.2024.103097","DOIUrl":"10.1016/j.parco.2024.103097","url":null,"abstract":"<div><p>Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes <span>Muppet</span>, a new approach that identifies program modifications called <em>mutations</em> aimed at improving program performance. <span>Muppet</span>’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), <span>Muppet</span> uses the idea of source-level <em>mutation testing</em> to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the <span>Muppet</span>’s concept in the OpenMP programming model. <span>Muppet</span> generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When <span>Muppet</span> is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103097"},"PeriodicalIF":2.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000358/pdfft?md5=139743a6196b36bc64bd1733300112aa&pid=1-s2.0-S0167819124000358-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstractions for C++ code optimizations in parallel high-performance applications 并行高性能应用程序中的 C++ 代码优化抽象
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-14 DOI: 10.1016/j.parco.2024.103096
Jiří Klepl, Adam Šmelko, Lukáš Rozsypal, Martin Kruliš
{"title":"Abstractions for C++ code optimizations in parallel high-performance applications","authors":"Jiří Klepl,&nbsp;Adam Šmelko,&nbsp;Lukáš Rozsypal,&nbsp;Martin Kruliš","doi":"10.1016/j.parco.2024.103096","DOIUrl":"10.1016/j.parco.2024.103096","url":null,"abstract":"<div><p>Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C<span>++</span> template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103096"},"PeriodicalIF":2.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000346/pdfft?md5=9cd8ac7a1eebfc9480655a05bba5ca50&pid=1-s2.0-S0167819124000346-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142012840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mobilizing underutilized storage nodes via job path: A job-aware file striping approach 通过工作路径调动未充分利用的存储节点:作业感知文件条带化方法
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-08-10 DOI: 10.1016/j.parco.2024.103095
Gang Xian , Wenxiang Yang , Yusong Tan , Jinghua Feng , Yuqi Li , Jian Zhang , Jie Yu
{"title":"Mobilizing underutilized storage nodes via job path: A job-aware file striping approach","authors":"Gang Xian ,&nbsp;Wenxiang Yang ,&nbsp;Yusong Tan ,&nbsp;Jinghua Feng ,&nbsp;Yuqi Li ,&nbsp;Jian Zhang ,&nbsp;Jie Yu","doi":"10.1016/j.parco.2024.103095","DOIUrl":"10.1016/j.parco.2024.103095","url":null,"abstract":"<div><p>Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103095"},"PeriodicalIF":2.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture NxtSPR:基于三核的多核架构的中继专用无死锁最短路径路由算法
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-07-24 DOI: 10.1016/j.parco.2024.103094
Chunfeng Li, Karim Soliman, Fei Yin, Jin Wei, Feng Shi
{"title":"NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture","authors":"Chunfeng Li,&nbsp;Karim Soliman,&nbsp;Fei Yin,&nbsp;Jin Wei,&nbsp;Feng Shi","doi":"10.1016/j.parco.2024.103094","DOIUrl":"10.1016/j.parco.2024.103094","url":null,"abstract":"<div><p>Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103094"},"PeriodicalIF":2.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation 多 GPU 3D k 最近邻计算在 ICP、点云平滑和法线计算中的应用
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-07-02 DOI: 10.1016/j.parco.2024.103093
Alexander Agathos , Philip Azariadis
{"title":"Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation","authors":"Alexander Agathos ,&nbsp;Philip Azariadis","doi":"10.1016/j.parco.2024.103093","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103093","url":null,"abstract":"<div><p>The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103093"},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel WBSP:利用工作繁忙同步并行技术解决分布式机器学习中的落后问题
IF 2 4区 计算机科学
Parallel Computing Pub Date : 2024-06-29 DOI: 10.1016/j.parco.2024.103092
Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You
{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang ,&nbsp;Bing Hu ,&nbsp;An Liu ,&nbsp;A-Long Jin ,&nbsp;Kwan L. Yeung ,&nbsp;Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer 用两种不同方法扩展 LR-TDDFT 的极限:数值算法和新型 Sunway 异构超级计算机
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2024-05-04 DOI: 10.1016/j.parco.2024.103085
Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang
{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang ,&nbsp;Zhenwei Cao ,&nbsp;Xinhui Cui ,&nbsp;Lingyun Wan ,&nbsp;Xinming Qin ,&nbsp;Huanqi Cao ,&nbsp;Hong An ,&nbsp;Junshi Chen ,&nbsp;Jie Liu ,&nbsp;Wei Hu ,&nbsp;Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA 利用 OmpSs 和 CUDA 实现 ALC-PSO 算法的低功耗异构并行计算方法
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2024-03-26 DOI: 10.1016/j.parco.2024.103084
Fahimeh Yazdanpanah, Mohammad Alaei
{"title":"An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA","authors":"Fahimeh Yazdanpanah,&nbsp;Mohammad Alaei","doi":"10.1016/j.parco.2024.103084","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103084","url":null,"abstract":"<div><p>PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103084"},"PeriodicalIF":1.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140327837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated learning based modulation classification for multipath channels 基于联合学习的多径信道调制分类
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2024-03-16 DOI: 10.1016/j.parco.2024.103083
Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim
{"title":"Federated learning based modulation classification for multipath channels","authors":"Sanjay Bhardwaj,&nbsp;Da-Hye Kim,&nbsp;Dong-Seong Kim","doi":"10.1016/j.parco.2024.103083","DOIUrl":"10.1016/j.parco.2024.103083","url":null,"abstract":"<div><p>Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters PPS:多租户 GPU 集群的公平高效黑盒调度
IF 1.4 4区 计算机科学
Parallel Computing Pub Date : 2024-03-12 DOI: 10.1016/j.parco.2024.103082
Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng
{"title":"PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters","authors":"Kaihao Ma ,&nbsp;Zhenkun Cai ,&nbsp;Xiao Yan ,&nbsp;Yang Zhang ,&nbsp;Zhi Liu ,&nbsp;Yihui Feng ,&nbsp;Chao Li ,&nbsp;Wei Lin ,&nbsp;James Cheng","doi":"10.1016/j.parco.2024.103082","DOIUrl":"10.1016/j.parco.2024.103082","url":null,"abstract":"<div><p>Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140275754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信