CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00068

Xinyao Yi, D. Stokes, Yonghong Yan, C. Liao

{"title":"CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming","authors":"Xinyao Yi, D. Stokes, Yonghong Yan, C. Liao","doi":"10.1109/IPDPSW52791.2021.00068","DOIUrl":null,"url":null,"abstract":"Programming to achieve high performance for NVIDIA GPUs using CUDA has been known to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit sufficient parallelism to achieve maximum GPU utilization. A system with GPU accelerators has a heterogeneous and deep memory system that programmers must effectively and correctly use to fully take advantage of the GPU’s parallelism capability. In this paper, we present CUDAMicroBench, a collection of fourteen microbenchmarks that demonstrate performance challenges in CUDA programming and techniques to optimize the CUDA programs to address these challenges. It also includes examples and techniques for using advanced CUDA features such as data shuffling between threads, dynamic parallelism, etc that can help users optimize the CUDA program for performance. The microbenchmark can be used for evaluating the performance of GPU architectures, the memory systems of GPU itself and of the whole system architectures, and for evaluating the effectiveness of compiler and performance tools for performance analysis. It can be used to help users understand the complexity of heterogeneous GPU-accelerator systems through examples and guide users for performance optimization. It is released as BSD-licensed open-source from https://github.com/passlab/CUDAMicroBench.git.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Programming to achieve high performance for NVIDIA GPUs using CUDA has been known to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit sufficient parallelism to achieve maximum GPU utilization. A system with GPU accelerators has a heterogeneous and deep memory system that programmers must effectively and correctly use to fully take advantage of the GPU’s parallelism capability. In this paper, we present CUDAMicroBench, a collection of fourteen microbenchmarks that demonstrate performance challenges in CUDA programming and techniques to optimize the CUDA programs to address these challenges. It also includes examples and techniques for using advanced CUDA features such as data shuffling between threads, dynamic parallelism, etc that can help users optimize the CUDA program for performance. The microbenchmark can be used for evaluating the performance of GPU architectures, the memory systems of GPU itself and of the whole system architectures, and for evaluating the effectiveness of compiler and performance tools for performance analysis. It can be used to help users understand the complexity of heterogeneous GPU-accelerator systems through examples and guide users for performance optimization. It is released as BSD-licensed open-source from https://github.com/passlab/CUDAMicroBench.git.

查看原文本刊更多论文

cudammicrobench:辅助CUDA性能编程的微基准测试

众所周知，使用CUDA为NVIDIA gpu实现高性能的编程是具有挑战性的。GPU有数百或数千个核心，程序必须表现出足够的并行性才能实现最大的GPU利用率。具有GPU加速器的系统具有异构和深度内存系统，程序员必须有效和正确地使用该系统以充分利用GPU的并行能力。在本文中，我们提出了cudammicrobench，这是一个14个微基准测试的集合，展示了CUDA编程中的性能挑战和优化CUDA程序以应对这些挑战的技术。它还包括使用高级CUDA功能的示例和技术，例如线程之间的数据变换，动态并行等，可以帮助用户优化CUDA程序的性能。微基准测试可以用来评估GPU架构、GPU本身的内存系统和整个系统架构的性能，也可以用来评估编译器和性能分析工具的有效性。它可以通过示例帮助用户了解异构gpu加速器系统的复杂性，并指导用户进行性能优化。它以bsd许可的开源形式从https://github.com/passlab/CUDAMicroBench.git发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量