Parallel Computing最新文献_第4页

A GPU-based hydrodynamic simulator with boid interactions 基于 GPU 的水动力模拟器与boid 的相互作用

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-12-21 DOI: 10.1016/j.parco.2023.103062

Xi Liu, Gizem Kayar, Ken Perlin

{"title":"A GPU-based hydrodynamic simulator with boid interactions","authors":"Xi Liu, Gizem Kayar, Ken Perlin","doi":"10.1016/j.parco.2023.103062","DOIUrl":"10.1016/j.parco.2023.103062","url":null,"abstract":"<div><p>We present a hydrodynamic simulation system using the GPU compute shaders of DirectX for simulating virtual agent behaviors and navigation inside a smoothed particle hydrodynamical (SPH) fluid environment with real-time water mesh surface reconstruction. The current SPH literature includes interactions between SPH and heterogeneous meshes but seldom involves interactions between SPH and virtual boid agents. The contribution of the system lies in the combination of the parallel smoothed particle hydrodynamics model with the distributed boid model of virtual agents to enable agents to interact with fluids. The agents based on the boid algorithm influence the motion of SPH fluid particles, and the forces from the SPH algorithm affect the movement of the boids. To enable realistic fluid rendering and simulation in a particle-based system, it is essential to construct a mesh from the particle attributes. Our system also contributes to the surface reconstruction aspect of the pipeline, in which we performed a set of experiments with the parallel marching cubes algorithm per frame for constructing the mesh from the fluid particles in a real-time compute and memory-intensive application, producing a wide range of triangle configurations. We also demonstrate that our system is versatile enough for reinforced robotic agents instead of boid agents to interact with the fluid environment for underwater navigation and remote control engineering purposes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103062"},"PeriodicalIF":1.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819123000686/pdfft?md5=c561b22916df38cc210c4a6988c337bc&pid=1-s2.0-S0167819123000686-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139028634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Program partitioning and deadlock analysis for MPI based on logical clocks 基于逻辑时钟的MPI程序分区和死锁分析

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-12-04 DOI: 10.1016/j.parco.2023.103061

Shushan Li , Meng Wang , Hong Zhang , Yao Liu

{"title":"Program partitioning and deadlock analysis for MPI based on logical clocks","authors":"Shushan Li , Meng Wang , Hong Zhang , Yao Liu","doi":"10.1016/j.parco.2023.103061","DOIUrl":"10.1016/j.parco.2023.103061","url":null,"abstract":"<div><p>The message passing interface (MPI) has become a standard for programming models in the field of high performance computing. It is of great importance to ensure the reliability of MPI programs by detecting whether there exist errors in them. However, as one of the most common errors in MPI programs, deadlock is difficult to detect due to the non-determinism and the asynchronous communication supported by MPI. Existing approaches mainly focus on detecting deadlocks by traversing all possible execution paths in an MPI program. But in this way the detection efficiency is always limited since the number of execution paths increases exponentially with the number of wildcard receives and processes in the program.</p><p>In order to alleviate the path explosion problem for single-path MPI programs, we propose a program partitioning approach based on logical clocks to detecting deadlocks. In the approach, the program is first divided into several preliminary partitions based on the matching detection rule. Then to obtain the dependency relationships of partitions, the Binary Lazy Clocks algorithm is raised to mark clocks for communication operations. Based on the clocks, the completion orders of communication operations in each process of the program are tracked. Further, we get the dependency relationships of the preliminary partitions by analyzing these completion orders and merge the preliminary partitions with the dependency relationships for generating the final partitions. Finally, deadlocks are detected by traversing all possible execution paths of each final partition. We have implemented our method in a tool called PDMPI and performed experimental evaluation on 14 programs. The experimental results indicate that PDMPI is more effective for detecting deadlocks in MPI programs than two most related tools ISP and SAMPI, especially in programs with numerous interleavings.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103061"},"PeriodicalIF":1.4,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819123000674/pdfft?md5=544d7a7d482400a8b6dab8a9d68a3fba&pid=1-s2.0-S0167819123000674-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138537331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning 分布式深度学习中张量融合的近最优通信机制

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-11-01 DOI: 10.1016/j.parco.2023.103053

Yunqi Gao , Zechao Zhang , Bing Hu , A-Long Jin , Chunming Wu

引用次数: 0

Low consumption automatic discovery protocol for DDS-based large-scale distributed parallel computing 基于dds的大规模分布式并行计算低消耗自动发现协议

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-11-01 DOI: 10.1016/j.parco.2023.103052

Zhexu Liu , Shaofeng Liu , Zhiyong Fan , Zhen Zhao

{"title":"Low consumption automatic discovery protocol for DDS-based large-scale distributed parallel computing","authors":"Zhexu Liu , Shaofeng Liu , Zhiyong Fan , Zhen Zhao","doi":"10.1016/j.parco.2023.103052","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103052","url":null,"abstract":"<div><p><span><span>DDS (Data Distribution Service) is an efficient communication specification for distributed parallel computing. However, as the scale of computation expands, high network load and memory consumption consistently limit its performance. This paper proposes a low consumption automatic discovery protocol to improve DDS in large-scale distributed parallel computing. Firstly, an improved Bloom Filter called TBF (Threshold Bloom Filter) is presented to compress the data topic. Then it is combined with the SDP(Simple Discovery Protocol) to reduce the consumption of the automatic discovery process in DDS. On this basis, data publication and subscription between the </span>distributed computing<span> nodes are matched using binarization threshold </span></span><span><math><mi>θ</mi></math></span> and decision threshold <span><math><mi>T</mi></math></span><span> , which can be obtained through iterative optimization algorithms. Experiment results show that the SDPTBF can guarantee higher transmission accuracy while reducing network load and memory consumption, and therefore improve the performance of DDS-based large-scale distributed parallel computing.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"118 ","pages":"Article 103052"},"PeriodicalIF":1.4,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"109182030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Targeting performance and user-friendliness: GPU-accelerated finite element computation with automated code generation in FEniCS 目标性能和用户友好性:gpu加速有限元计算与fenic中的自动代码生成

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-10-06 DOI: 10.1016/j.parco.2023.103051

James D. Trotter , Johannes Langguth , Xing Cai

{"title":"Targeting performance and user-friendliness: GPU-accelerated finite element computation with automated code generation in FEniCS","authors":"James D. Trotter , Johannes Langguth , Xing Cai","doi":"10.1016/j.parco.2023.103051","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103051","url":null,"abstract":"<div><p>This paper studies the use of automated code generation to provide user-friendly GPU acceleration for solving partial differential equations (PDEs) with finite element methods. By extending the FEniCS framework and its automated compiler, we have achieved that a high-level description of finite element computations written in the Unified Form Language is auto-translated to parallelised CUDA C++ code. The auto-generated code provides GPU offloading for the finite element assembly of linear equation systems which are then solved by a GPU-supported linear algebra backend.</p><p>Specifically, we explore several auto-generated optimisations of the resulting CUDA C++ code. Numerical experiments show that GPU-based linear system assembly for a typical PDE with first-order elements can benefit from using a lookup table to avoid repeatedly carrying out numerous binary searches, and that further performance gains can be obtained by assembling a sparse matrix row by row. More importantly, the extended FEniCS compiler is able to seamlessly couple the assembly and solution phases for GPU acceleration, so that all unnecessary CPU–GPU data transfers are eliminated. Detailed experiments are used to quantify the negative impact of these data transfers, which can entirely destroy the potential of GPU acceleration if the assembly and solution phases are offloaded to GPU separately. Finally, a complete, auto-generated GPU-based PDE solver for a nonlinear solid mechanics application is used to demonstrate a substantial speedup over running on dual-socket multi-core CPUs, including GPU acceleration of algebraic multigrid as the preconditioner.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"118 ","pages":"Article 103051"},"PeriodicalIF":1.4,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49881777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task graph-based performance analysis of parallel-in-time methods 基于任务图的并行实时性能分析方法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-09-14 DOI: 10.1016/j.parco.2023.103050

Matthias Bolten, Stephanie Friedhoff, Jens Hahne

{"title":"Task graph-based performance analysis of parallel-in-time methods","authors":"Matthias Bolten, Stephanie Friedhoff, Jens Hahne","doi":"10.1016/j.parco.2023.103050","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103050","url":null,"abstract":"<div><p>In this paper, we present a performance model based on task graphs for various iterative parallel-in-time (PinT) methods. PinT methods have been developed to speed up the simulation time of time-dependent problems using modern parallel supercomputers<span>. The performance model is based on a data-driven notation of the methods, from which a task graph is generated. Based on this task graph and a distribution of time points across processes typical for PinT methods, a theoretical lower runtime bound for the method can be obtained, as well as a prediction of the runtime for a given number of processes. In particular, the model is able to cover the large parameter space of PinT methods and make predictions for arbitrary parameter settings. Here, we describe a general procedure for generating task graphs based on three iterative PinT methods, namely, Parareal, multigrid-reduction-in-time (MGRIT), and the parallel full approximation scheme in space and time (PFASST). Furthermore, we discuss how these task graphs can be used to analyze the performance of the methods. In addition, we compare the predictions of the model with parallel simulation times using five different PinT libraries.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"118 ","pages":"Article 103050"},"PeriodicalIF":1.4,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49881776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed software defined network-based fog to fog collaboration scheme 分布式软件定义的基于网络的雾对雾协作方案

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-09-01 DOI: 10.1016/j.parco.2023.103040

Muhammad Kabeer , Ibrahim Yusuf , Nasir Ahmad Sufi

{"title":"Distributed software defined network-based fog to fog collaboration scheme","authors":"Muhammad Kabeer , Ibrahim Yusuf , Nasir Ahmad Sufi","doi":"10.1016/j.parco.2023.103040","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103040","url":null,"abstract":"<div><p><span><span>Fog computing was created to supplement the cloud in bridging the communication delay gap by deploying fog nodes nearer to </span>Internet of Things<span> (IoT) devices. Depending on the geographical location, computational resource and rate of IoT requests, fog nodes can be idle or saturated. The latter requires special mechanism to enable collaboration with other nodes through service offloading to improve resource utilization. Software Defined Network (SDN) comes with improved bandwidth, latency and understanding of </span></span>network topology<span>, which recently attracted researchers attention and delivers promising results in service offloading. In this study, a Hierarchical Distributed Software Defined Network-based (DSDN) fog to fog collaboration model is proposed; the scheme considers computational resources such as available CPU and network resources such as communication hops of a prospective offloading node. Fog nodes having limited resources coupled with the projected high demand for fog services in the near future, the model also accounts for extreme cases in which all nearby nodes in a fog domain are saturated, employing a supervisor controller to scale the collaboration to other domains. The results of the simulations carried out on Mininet shows that the proposed multi-controller DSDN solution outperforms the traditional single controller SDN solution, it also further demonstrate that increase in the number of fog nodes does not affect service offloading performance significantly when multiple controllers are used.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 ","pages":"Article 103040"},"PeriodicalIF":1.4,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49877856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing massively parallel sparse matrix computing on ARM many-core processor ARM多核处理器上大规模并行稀疏矩阵计算优化

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-09-01 DOI: 10.1016/j.parco.2023.103035

Jiang Zheng , Jiazhi Jiang , Jiangsu Du, Dan Huang, Yutong Lu

{"title":"Optimizing massively parallel sparse matrix computing on ARM many-core processor","authors":"Jiang Zheng , Jiazhi Jiang , Jiangsu Du, Dan Huang, Yutong Lu","doi":"10.1016/j.parco.2023.103035","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103035","url":null,"abstract":"<div><p><span><span>Sparse matrix multiplication is ubiquitous in many applications such as graph processing and numerical simulation. In recent years, numerous efficient sparse matrix multiplication algorithms and computational libraries have been proposed. However, most of them are oriented to x86 or GPU platforms, while the optimization on ARM many-core platforms has not been well investigated. Our experiments show that existing sparse matrix multiplication libraries for ARM many-core CPU cannot achieve expected parallel performance. Compared with traditional multi-core CPU, ARM many-core CPU has far more cores and often adopts </span>NUMA techniques to scale the </span>memory bandwidth. Its parallel efficiency tends to be restricted by NUMA configuration, memory bandwidth cache contention, etc.</p><p>In this paper, we propose optimized implementations for sparse matrix computing on ARM many-core CPU. We propose various optimization techniques for several routines of sparse matrix multiplication to ensure coalesced access<span> of matrix elements in the memory. In detail, the optimization techniques include a fine-tuned CSR-based format for ARM architecture, co-optimization of Gustavson’s algorithm with hierarchical cache and dense array strategy to mitigate performance loss caused by handling compressed storage formats. We exploit the coarse-grained NUMA-aware strategy for inter-node parallelism and the fine-grained cache-aware strategy for intra-node parallelism to improve the parallel efficiency of sparse matrix multiplication. The evaluation shows that our implementation consistently outperforms the existing library on ARM many-core processor.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 ","pages":"Article 103035"},"PeriodicalIF":1.4,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49877864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial on Advances in High Performance Programming 关于高性能编程进展的社论

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-09-01 DOI: 10.1016/j.parco.2023.103037

A. Marowka, Przemysław Stpiczyński

引用次数: 0

Parallelizable efficient large order multiple recursive generators 并行化高效大阶多重递归生成器

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2023-09-01 DOI: 10.2139/ssrn.4344139

L. Deng, Bryan R. Winter, J. H. Shiau, Henry Horng-Shing Lu, Nirman Kumar, Ching-Chi Yang

引用次数: 0