高性能的便携式OpenMP

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI:10.1145/3497776.3517780

Guray Ozen, M. Wolfe

{"title":"高性能的便携式OpenMP","authors":"Guray Ozen, M. Wolfe","doi":"10.1145/3497776.3517780","DOIUrl":null,"url":null,"abstract":"Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performant portable OpenMP\",\"authors\":\"Guray Ozen, M. Wolfe\",\"doi\":\"10.1145/3497776.3517780\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.\",\"PeriodicalId\":333281,\"journal\":{\"name\":\"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction\",\"volume\":\"235 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3497776.3517780\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3497776.3517780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

加速计算增加了对程序如何根据目标进行并行化的专门化的需求。与多核CPU相比，充分利用高度并行的加速器(如GPU)需要更多的并行性，有时甚至需要更多的并行级别。OpenMP针对每个并行级别都有一个指令，但是为每个目标选择指令可能会导致显著的生产力成本。我们认为，使用新的OpenMP循环指令和适当的编译器决策过程可以实现与特定目标并行化相同的性能优势，并具有针对所有目标的单个指令的生产力优势。在本文中，我们介绍了一个完全描述性的模型，并通过循环指令的实现演示了它的好处，使用SPEC ACCEL基准套件将性能、生产力和可移植性与其他生产编译器进行了比较。我们在NVIDIA的HPC编译器中提供了我们的建议的实现。与gpu上的基准性能(取决于主机系统)相比，它可以产生高达56X的加速，平均速度为1.91x-1.79x，并保持CPU性能。此外，我们的建议需要减少60%的并行指令。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performant portable OpenMP

Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

自引率

0.00%

发文量