Parallel Computing最新文献_第3页

WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel WBSP：利用工作繁忙同步并行技术解决分布式机器学习中的落后问题

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-06-29 DOI: 10.1016/j.parco.2024.103092

Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You

{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer 用两种不同方法扩展 LR-TDDFT 的极限：数值算法和新型 Sunway 异构超级计算机

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-05-04 DOI: 10.1016/j.parco.2024.103085

Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang

{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA 利用 OmpSs 和 CUDA 实现 ALC-PSO 算法的低功耗异构并行计算方法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-03-26 DOI: 10.1016/j.parco.2024.103084

Fahimeh Yazdanpanah, Mohammad Alaei

{"title":"An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA","authors":"Fahimeh Yazdanpanah, Mohammad Alaei","doi":"10.1016/j.parco.2024.103084","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103084","url":null,"abstract":"<div><p>PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103084"},"PeriodicalIF":1.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140327837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated learning based modulation classification for multipath channels 基于联合学习的多径信道调制分类

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-03-16 DOI: 10.1016/j.parco.2024.103083

Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim

{"title":"Federated learning based modulation classification for multipath channels","authors":"Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim","doi":"10.1016/j.parco.2024.103083","DOIUrl":"10.1016/j.parco.2024.103083","url":null,"abstract":"<div><p>Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters PPS：多租户 GPU 集群的公平高效黑盒调度

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-03-12 DOI: 10.1016/j.parco.2024.103082

Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng

{"title":"PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters","authors":"Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng","doi":"10.1016/j.parco.2024.103082","DOIUrl":"10.1016/j.parco.2024.103082","url":null,"abstract":"<div><p>Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140275754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing the impact of CUDA versions on GPU applications 分析 CUDA 版本对 GPU 应用程序的影响

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-02-29 DOI: 10.1016/j.parco.2024.103081

Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda

{"title":"Analyzing the impact of CUDA versions on GPU applications","authors":"Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda","doi":"10.1016/j.parco.2024.103081","DOIUrl":"10.1016/j.parco.2024.103081","url":null,"abstract":"<div><p>CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103081"},"PeriodicalIF":1.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016781912400019X/pdfft?md5=62bfbd6666b978a441b0d0daa8420592&pid=1-s2.0-S016781912400019X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture 在新一代 Sunway 架构上并行优化和应用非结构化稀疏三角求解器

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-02-28 DOI: 10.1016/j.parco.2024.103080

Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi

{"title":"Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture","authors":"Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi","doi":"10.1016/j.parco.2024.103080","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103080","url":null,"abstract":"<div><p>Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103080"},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140024112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for parallel computing 并行计算》编辑部

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103065

Anne Benoit

引用次数: 0

Integrating FPGA-based hardware acceleration with relational databases 将基于 FPGA 的硬件加速与关系数据库相结合

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103064

Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang

{"title":"Integrating FPGA-based hardware acceleration with relational databases","authors":"Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang","doi":"10.1016/j.parco.2024.103064","DOIUrl":"10.1016/j.parco.2024.103064","url":null,"abstract":"<div><p>The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103064"},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000024/pdfft?md5=d270aeec859768a5bff3f5d4988863f9&pid=1-s2.0-S0167819124000024-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139825507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast data-dependence profiling through prior static analysis 通过先期静态分析快速剖析数据依赖性

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2024-01-11 DOI: 10.1016/j.parco.2024.103063

Mohammad Norouzi , Nicolas Morew , Qamar Ilias , Lukas Rothenberger , Ali Jannesari , Felix Wolf

{"title":"Fast data-dependence profiling through prior static analysis","authors":"Mohammad Norouzi , Nicolas Morew , Qamar Ilias , Lukas Rothenberger , Ali Jannesari , Felix Wolf","doi":"10.1016/j.parco.2024.103063","DOIUrl":"10.1016/j.parco.2024.103063","url":null,"abstract":"<div><p>Data-dependence profiling is a program-analysis technique for detecting parallelism opportunities in sequential programs. It captures data dependences that actually occur during program execution, filtering parallelism-preventing dependences that purely static methods assume only because they lack critical runtime information, such as the values of pointers and array indices. Profiling, however, suffers from high runtime overhead. In our earlier work, we accelerated data-dependence profiling by excluding polyhedral loops that can be handled statically using certain compilers and eliminating scalar variables that create statically-identifiable data dependences. In this paper, we combine the two methods and integrate them into DiscoPoP, a data-dependence profiler and parallelism discovery tool. Additionally, we detect reduction patterns statically and unify the three static analyses with the DiscoPoP framework to significantly diminish the profiling overhead and for a wider range of programs. We have evaluated our unified approaches with 49 benchmarks from three benchmark suites and two computer simulation applications. The evaluation results show that our approach reports fewer false positive and negative data dependences than the original data-dependence profiler and reduces the profiling time by at least 43%, with a median reduction of 76% across all programs. Also, we identify 40% of reduction cases statically and eliminate the associated profiling overhead for these cases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103063"},"PeriodicalIF":1.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000012/pdfft?md5=99e5ae1bcda1fac5d3c65fb23d0ba7f8&pid=1-s2.0-S0167819124000012-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0