Parallel Computing最新文献_第9页

NekRS, a GPU-accelerated spectral element Navier–Stokes solver NekRS, gpu加速谱元Navier-Stokes解算器

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102982

Paul Fischer , Stefan Kerkemeier , Misun Min , Yu-Hsiang Lan , Malachi Phillips , Thilina Rathnayake , Elia Merzari , Ananias Tomboulides , Ali Karakus , Noel Chalmers , Tim Warburton

引用次数: 53

SGPM: A coroutine framework for transaction processing SGPM:用于事务处理的协程框架

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102980

Xinyuan Wang, Hejiao Huang

{"title":"SGPM: A coroutine framework for transaction processing","authors":"Xinyuan Wang, Hejiao Huang","doi":"10.1016/j.parco.2022.102980","DOIUrl":"10.1016/j.parco.2022.102980","url":null,"abstract":"<div>Coroutine is able to increase program concurrency and processor core utilization. However, for adapting the coroutine-to-transaction model, the existing coroutine package has the following disadvantages: (1) Additional scheduler threads incur synchronization overhead when the load between scheduler threads and worker threads is unbalanced. (2) Coroutines are swapped out periodically to prevent deadlocks, which will increase the conflict rate by adding suspended transactions. (3) Supporting only the swap-out function (yield, await, etc.) cannot flexibly control the transaction swap-in time. In this paper, we present SGPM, a coroutine framework for transaction processing. To adapt to the coroutine-to-transaction model, SGPM has the following properties: First, it eliminates scheduler threads and the periodic coroutine switch. Second, it provides a variety of coroutine scheduling strategies to make all types of concurrency control protocols run on SGPM reasonably. We implement eight well-known concurrency control on SGPM and, particularly, we use SGPM to optimize the performance of four wound-wait concurrency control among them, including 2PL, SS2PL, Calvin, and EWV. The experiment result demonstrates that after SGPM optimization 2PL and SS2PL outperform OCC and MVCC, and the throughput of Calvin and EWV is also improved by 1.2x and 1.3x respectively.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102980"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77557910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA Tausch:一个halo交换库，用于使用MPI、OpenCL和CUDA的大型异构计算系统

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102973

Lukas Spies , Amanda Bienz , David Moulton , Luke Olson , Andrew Reisner

{"title":"Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA","authors":"Lukas Spies , Amanda Bienz , David Moulton , Luke Olson , Andrew Reisner","doi":"10.1016/j.parco.2022.102973","DOIUrl":"10.1016/j.parco.2022.102973","url":null,"abstract":"<div>Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both CPUs and GPUs. It currently supports both OpenCL and CUDA for communicating with GPGPU devices, and allows for communication between GPGPUs and CPUs. The API allows for drop-in replacement in existing codes and can be used for the communication layer in new codes. This paper provides an overview of the approach taken in Tausch, and a performance analysis that demonstrates expected and achieved performance. We highlight the ease of use and performance with three applications: First Tausch is compared to the halo exchange framework from two Mantevo applications, HPCCG and miniFE, and then it is used to replace a legacy halo exchange library in the flexible multigrid solver framework Cedar.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102973"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85992755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Graph optimization algorithm using symmetry and host bias for low-latency indirect network 基于对称和主机偏差的低延迟间接网络图优化算法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102983

Masahiro Nakao , Masaki Tsukamoto , Yoshiko Hanada , Keiji Yamamoto

{"title":"Graph optimization algorithm using symmetry and host bias for low-latency indirect network","authors":"Masahiro Nakao , Masaki Tsukamoto , Yoshiko Hanada , Keiji Yamamoto","doi":"10.1016/j.parco.2022.102983","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102983","url":null,"abstract":"<div>It is known that an indirect network with a small host-to-host Average Shortest Path Length (h-ASPL) improves overall system performance in a parallel computer system. As a means to discuss such indirect networks in graph theory, the Order/Radix Problem (ORP) has been proposed. ORP involves finding a graph with a minimum h-ASPL that satisfies a given number of hosts and radix. A graph in ORP represents an indirect network and has two types of vertices: host and switch. We propose an optimization algorithm to generate graphs with a sufficiently small h-ASPL. The primary features of the proposed algorithm are the symmetry of the graph and the bias of the hosts adjacent to each switch. These features reduce the computational time to calculate the h-ASPL and improve the search performance of the algorithm. The performance of the proposed algorithm is evaluated using problems presented by Graph Golf, an international ORP competition. Our results show that the proposed algorithm can generate graphs with a smaller h-ASPL than the existing algorithm. To evaluate the performance of the graphs generated by the proposed algorithm, we use the parallel simulation framework SimGrid and the parallel benchmark collection NAS Parallel Benchmarks. Our results also show that the graphs generated by the proposed algorithm have higher performance than those generated by the existing algorithm.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102983"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000722/pdfft?md5=70b6cbe2b73c6952541b7170b6406471&pid=1-s2.0-S0167819122000722-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137225368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments 操作数据分析的实践:从设计到生产HPC环境部署的经验

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102950

Alessio Netti , Michael Ott , Carla Guillen , Daniele Tafani , Martin Schulz

{"title":"Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments","authors":"Alessio Netti , Michael Ott , Carla Guillen , Daniele Tafani , Martin Schulz","doi":"10.1016/j.parco.2022.102950","DOIUrl":"10.1016/j.parco.2022.102950","url":null,"abstract":"<div>As HPC systems continue to grow in scale and complexity, efficient and manageable operation is increasingly critical. For this reason, many centers are starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from the massive amounts of data produced by monitoring systems and use it for enacting control over system knobs, or for aiding administrators through visualization. As ODA is a multi-faceted problem, much research effort has gone into finding solutions to its separate aspects: however, comprehensive solutions to enable production use of ODA are still rare, while accounts of ODA experiences and the associated challenges are even harder to come across.In this work we aim to bridge the gap between ODA research and production use by presenting our own experiences, associated with proactive control of warm-water inlet temperatures and visualization of job data on two different HPC systems. We cover the entire development process, starting from a description of requirements and challenges, and down to design, deployment and evaluation. Moreover, we discuss a series of critical points related to the maintainability of ODA, and propose action items in an effort to drive the community forward. We rely on a series of open-source tools and techniques, which make for a generic ODA framework that is suitable for most use cases.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102950"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74644871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating communication for parallel programming models on GPU systems 加速GPU系统上并行编程模型的通信

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102969

Jaemin Choi , Zane Fink , Sam White , Nitin Bhat , David F. Richards , Laxmikant V. Kale

{"title":"Accelerating communication for parallel programming models on GPU systems","authors":"Jaemin Choi , Zane Fink , Sam White , Nitin Bhat , David F. Richards , Laxmikant V. Kale","doi":"10.1016/j.parco.2022.102969","DOIUrl":"10.1016/j.parco.2022.102969","url":null,"abstract":"<div>As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102969"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82219606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Optimizing small channel 3D convolution on GPU with tensor core 基于张量核的GPU小通道三维卷积优化

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102954

Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao

{"title":"Optimizing small channel 3D convolution on GPU with tensor core","authors":"Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao","doi":"10.1016/j.parco.2022.102954","DOIUrl":"10.1016/j.parco.2022.102954","url":null,"abstract":"<div>In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core.In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of computing units. Experiments show that our implementation can obtain 1.1x–5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102954"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78348079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Graph optimization algorithm using symmetry and host bias for low-latency indirect network 基于对称和主机偏差的低延迟间接网络图优化算法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.2139/ssrn.4048955

M. Nakao, M. Tsukamoto, Y. Hanada, Keiji Yamamoto

引用次数: 1

A method for efficient radio astronomical data gridding on multi-core vector processor 一种基于多核矢量处理器的射电天文数据高效网格化方法

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102972

Hao Wang , Ce Yu , Jian Xiao , Shanjiang Tang , Yu Lu , Hao Fu , Bo Kang , Gang Zheng , Chenzhou Cui

{"title":"A method for efficient radio astronomical data gridding on multi-core vector processor","authors":"Hao Wang , Ce Yu , Jian Xiao , Shanjiang Tang , Yu Lu , Hao Fu , Bo Kang , Gang Zheng , Chenzhou Cui","doi":"10.1016/j.parco.2022.102972","DOIUrl":"10.1016/j.parco.2022.102972","url":null,"abstract":"<div>Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high power consumption becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task parallelization and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102972"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75782731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU 基于qos的动态资源分配，提高了GPU的利用率和能效

IF 1.4 4区计算机科学

Parallel Computing Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102958

Qingxiao Sun , Liu Yi , Hailong Yang , Mingzhen Li , Zhongzhi Luan , Depei Qian

{"title":"QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU","authors":"Qingxiao Sun , Liu Yi , Hailong Yang , Mingzhen Li , Zhongzhi Luan , Depei Qian","doi":"10.1016/j.parco.2022.102958","DOIUrl":"10.1016/j.parco.2022.102958","url":null,"abstract":"<div>Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism SMQoS that can dynamically adjust the resource allocation during runtime to meet the QoS of latency-sensitive (LS) tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. We implement the proposed mechanism on both simulator (SMQoS) and real GPU hardware (RH-SMQoS). The experimental results show that both SMQoS and RH-SMQoS can achieve better QoS for LS tasks and higher throughput for batch tasks compared to the state-of-the-art works. With hardware extension, the SMQoS can further reduce the power consumption by power gating idle computing resources.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102958"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75432812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1