Parallel Computing最新文献_第2页

Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid 降低使用SimGrid开发分布式应用和平台的定制模拟器的入门门槛

IF 2 4区计算机科学

Parallel Computing Pub Date : 2025-01-20 DOI: 10.1016/j.parco.2025.103125

Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter

{"title":"Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid","authors":"Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter","doi":"10.1016/j.parco.2025.103125","DOIUrl":"10.1016/j.parco.2025.103125","url":null,"abstract":"<div><div>Researchers in parallel and distributed computing (PDC) often resort to simulation because experiments conducted using a simulator can be for arbitrary experimental scenarios, are less resource-, labor-, and time-consuming than their real-world counterparts, and are perfectly repeatable and observable. Many frameworks have been developed to ease the development of PDC simulators, and these frameworks provide different levels of accuracy, scalability, versatility, extensibility, and usability. The SimGrid framework has been used by many PDC researchers to produce a wide range of simulators for over two decades. Its popularity is due to a large emphasis placed on accuracy, scalability, and versatility, and is in spite of shortcomings in terms of extensibility and usability. Although SimGrid provides sensible simulation models for the common case, it was difficult for users to extend these models to meet domain-specific needs. Furthermore, SimGrid only provided relatively low-level simulation abstractions, making the implementation of a simulator of a complex system a labor-intensive undertaking. In this work we describe developments in the last decade that have contributed to vastly improving extensibility and usability, thus lowering or removing entry barriers for users to develop custom SimGrid simulators.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103125"},"PeriodicalIF":2.0,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable tasking runtime with parallelized builders for explicit message passing architectures 具有用于显式消息传递体系结构的并行构建器的可伸缩任务运行时

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-12-20 DOI: 10.1016/j.parco.2024.103124

Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng

{"title":"Scalable tasking runtime with parallelized builders for explicit message passing architectures","authors":"Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng","doi":"10.1016/j.parco.2024.103124","DOIUrl":"10.1016/j.parco.2024.103124","url":null,"abstract":"<div><div>The sequential task flow (STF) model introduces implicit data dependences to exploit task-based parallelism, simplifying programming but also introducing non-negligible runtime overhead. On emerging cache-less, explicit inter-core message passing (EMP) architectures, the long latency of memory access further amplifies the runtime overhead of the traditional STF model, resulting in unsatisfactory performance.</div><div>This paper addresses two main components in the STF tasking runtime. We uncover abundant concurrency in the task dependence graph (TDG) building process through three sufficient conditions, put forward PBH, a parallelized TDG building algorithm with helpers which mixes pipeline parallelism and data parallelism to overcome the TDG building bottleneck for fine-grained tasks. We also introduce a centralized, lock-less task scheduler, EMP-C, based on the EMP interface, and propose three optimizations. These two techniques are implemented and evaluated on a product processor with EMP support, i.e. SW26010. Experimental results show that compared to traditional techniques, PBH achieves an average speedup of 1.55 for fine-grained task workloads, and the EMP-C scheduler brings speedups as high as 1.52 and 2.38 for fine-grained and coarse-grained task workloads, respectively. And the combination of these two techniques significantly improves the granularity scalability of the runtime, reducing the minimum effective task granularity (METG) to 0.1 ms and achieving an order of magnitude decrease in some cases.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103124"},"PeriodicalIF":2.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Iterative methods in GPU-resident linear solvers for nonlinear constrained optimization 非线性约束优化的gpu驻留线性求解迭代方法

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-12-06 DOI: 10.1016/j.parco.2024.103123

Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš

{"title":"Iterative methods in GPU-resident linear solvers for nonlinear constrained optimization","authors":"Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš","doi":"10.1016/j.parco.2024.103123","DOIUrl":"10.1016/j.parco.2024.103123","url":null,"abstract":"<div><div>Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from Arioli et al., (2007) to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103123"},"PeriodicalIF":2.0,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143175823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards resilient and energy efficient scalable Krylov solvers 实现有弹性和高能效的可扩展克雷洛夫求解器

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-11-13 DOI: 10.1016/j.parco.2024.103122

Zheng Miao , Jon C. Calhoun , Rong Ge

{"title":"Towards resilient and energy efficient scalable Krylov solvers","authors":"Zheng Miao , Jon C. Calhoun , Rong Ge","doi":"10.1016/j.parco.2024.103122","DOIUrl":"10.1016/j.parco.2024.103122","url":null,"abstract":"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103122"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142703732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions Seesaw：基于 RISC-V ISA 扩展的用于加速 Kyber 的 4096 位矢量处理器

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-11-08 DOI: 10.1016/j.parco.2024.103121

Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang

{"title":"Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions","authors":"Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang","doi":"10.1016/j.parco.2024.103121","DOIUrl":"10.1016/j.parco.2024.103121","url":null,"abstract":"<div><div>The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432<span><math><mo>×</mo></math></span> and 18864<span><math><mo>×</mo></math></span> for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6<span><math><mo>×</mo></math></span> for the execution of the Kyber algorithm compared with the advanced hardware design.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103121"},"PeriodicalIF":2.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning FastPTM：快速加载预训练模型的权重以提供并行推理服务

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-10-10 DOI: 10.1016/j.parco.2024.103114

Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui

{"title":"FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning","authors":"Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui","doi":"10.1016/j.parco.2024.103114","DOIUrl":"10.1016/j.parco.2024.103114","url":null,"abstract":"<div><div>Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103114"},"PeriodicalIF":2.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142532380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix 基于分布式共识的非负不可还原矩阵前导特征值估算

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-10-05 DOI: 10.1016/j.parco.2024.103113

Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri

引用次数: 0

Parallel Pattern Compiler for Automatic Global Optimizations 自动全局优化的并行模式编译器

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-09-21 DOI: 10.1016/j.parco.2024.103112

Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller

{"title":"Parallel Pattern Compiler for Automatic Global Optimizations","authors":"Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller","doi":"10.1016/j.parco.2024.103112","DOIUrl":"10.1016/j.parco.2024.103112","url":null,"abstract":"<div><div>High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.</div><div>The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103112"},"PeriodicalIF":2.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142323330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism 基于灰狼优化和新编码机制的云计算任务调度

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-09-17 DOI: 10.1016/j.parco.2024.103111

Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang

引用次数: 0

An automated OpenMP mutation testing framework for performance optimization 用于性能优化的自动 OpenMP 突变测试框架

IF 2 4区计算机科学

Parallel Computing Pub Date : 2024-08-21 DOI: 10.1016/j.parco.2024.103097

Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González

{"title":"An automated OpenMP mutation testing framework for performance optimization","authors":"Dolores Miao , Ignacio Laguna , Giorgis Georgakoudis , Konstantinos Parasyris , Cindy Rubio-González","doi":"10.1016/j.parco.2024.103097","DOIUrl":"10.1016/j.parco.2024.103097","url":null,"abstract":"<div><p>Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes <span>Muppet</span>, a new approach that identifies program modifications called <em>mutations</em> aimed at improving program performance. <span>Muppet</span>’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), <span>Muppet</span> uses the idea of source-level <em>mutation testing</em> to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the <span>Muppet</span>’s concept in the OpenMP programming model. <span>Muppet</span> generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When <span>Muppet</span> is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103097"},"PeriodicalIF":2.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000358/pdfft?md5=139743a6196b36bc64bd1733300112aa&pid=1-s2.0-S0167819124000358-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0