在多核和多核宽矢量单元处理器上编译SIMT程序:CUDA的案例研究

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI:10.1109/HiPC.2018.00022

Hancheng Wu, J. Ravi, M. Becchi

{"title":"在多核和多核宽矢量单元处理器上编译SIMT程序:CUDA的案例研究","authors":"Hancheng Wu, J. Ravi, M. Becchi","doi":"10.1109/HiPC.2018.00022","DOIUrl":null,"url":null,"abstract":"Manycore processors and coprocessors with wide vector extensions, such as Intel Phi and Skylake devices, have become popular due to their high throughput capability. Performance optimization on these devices requires using both their x86-compatible cores and their vector units. While the x86-compatible cores can be programmed using traditional programming interfaces following the MIMD model, such as POSIX threads, MPI and OpenMP, the SIMD vector units are harder to program. The Intel software stack provides two approaches for code vectorization: automatic vectorization through the Intel compiler and manual vectorization through vector intrinsics. While the Intel compiler often fails to vectorize code with complex control flows and function calls, the manual approach is error-prone and leads to less portable code. Hence, there has been an increasing interest in SIMT programming tools allowing the simultaneous use of x86 cores and vector units while providing programmability and code portability. However, the effective implementation of the SIMT model on these hybrid architectures is not well understood. In this work, we target this problem. First, we propose a set of compiler techniques to transform programs written using a SIMT programming model (a subset of CUDA C) into code that leverages both the x86 cores and the vector units of a hybrid MIMD/SIMD architecture, thus providing programmability, high system utilization and performance. Second, we evaluate the proposed techniques on Xeon Phi and Skylake processors using micro-benchmarks and real-world applications. Third, we compare the resulting performance with that achieved by the same code on GPUs. Based on this analysis, we point out the main challenges in supporting the SIMT model on hybrid MIMD/SIMD architectures, while providing performance comparable to that of SIMT systems (e.g., GPUs).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA\",\"authors\":\"Hancheng Wu, J. Ravi, M. Becchi\",\"doi\":\"10.1109/HiPC.2018.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Manycore processors and coprocessors with wide vector extensions, such as Intel Phi and Skylake devices, have become popular due to their high throughput capability. Performance optimization on these devices requires using both their x86-compatible cores and their vector units. While the x86-compatible cores can be programmed using traditional programming interfaces following the MIMD model, such as POSIX threads, MPI and OpenMP, the SIMD vector units are harder to program. The Intel software stack provides two approaches for code vectorization: automatic vectorization through the Intel compiler and manual vectorization through vector intrinsics. While the Intel compiler often fails to vectorize code with complex control flows and function calls, the manual approach is error-prone and leads to less portable code. Hence, there has been an increasing interest in SIMT programming tools allowing the simultaneous use of x86 cores and vector units while providing programmability and code portability. However, the effective implementation of the SIMT model on these hybrid architectures is not well understood. In this work, we target this problem. First, we propose a set of compiler techniques to transform programs written using a SIMT programming model (a subset of CUDA C) into code that leverages both the x86 cores and the vector units of a hybrid MIMD/SIMD architecture, thus providing programmability, high system utilization and performance. Second, we evaluate the proposed techniques on Xeon Phi and Skylake processors using micro-benchmarks and real-world applications. Third, we compare the resulting performance with that achieved by the same code on GPUs. Based on this analysis, we point out the main challenges in supporting the SIMT model on hybrid MIMD/SIMD architectures, while providing performance comparable to that of SIMT systems (e.g., GPUs).\",\"PeriodicalId\":113335,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2018.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2018.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多核处理器和具有宽矢量扩展的协处理器，如Intel Phi和Skylake设备，由于其高吞吐量能力而变得流行。这些设备上的性能优化需要同时使用它们的x86兼容内核和矢量单元。虽然可以使用遵循MIMD模型的传统编程接口(如POSIX线程、MPI和OpenMP)对与x86兼容的内核进行编程，但SIMD矢量单元更难编程。英特尔软件栈为代码向量化提供了两种方法:通过英特尔编译器进行自动向量化和通过矢量特性进行手动向量化。虽然英特尔编译器经常无法将复杂控制流和函数调用的代码向量化，但手动方法容易出错，并且导致代码的可移植性较差。因此，人们对SIMT编程工具越来越感兴趣，它允许同时使用x86内核和矢量单元，同时提供可编程性和代码可移植性。然而，SIMT模型在这些混合体系结构上的有效实现还没有得到很好的理解。在这项工作中，我们针对这个问题。首先，我们提出了一套编译器技术，将使用SIMT编程模型(CUDA C的一个子集)编写的程序转换为利用x86内核和混合MIMD/SIMD架构的矢量单元的代码，从而提供可编程性，高系统利用率和性能。其次，我们在Xeon Phi和Skylake处理器上使用微基准测试和实际应用来评估所提出的技术。第三，我们将结果性能与gpu上相同代码所实现的性能进行比较。基于这一分析，我们指出了在混合MIMD/SIMD架构上支持SIMT模型的主要挑战，同时提供与SIMT系统(例如gpu)相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA

Manycore processors and coprocessors with wide vector extensions, such as Intel Phi and Skylake devices, have become popular due to their high throughput capability. Performance optimization on these devices requires using both their x86-compatible cores and their vector units. While the x86-compatible cores can be programmed using traditional programming interfaces following the MIMD model, such as POSIX threads, MPI and OpenMP, the SIMD vector units are harder to program. The Intel software stack provides two approaches for code vectorization: automatic vectorization through the Intel compiler and manual vectorization through vector intrinsics. While the Intel compiler often fails to vectorize code with complex control flows and function calls, the manual approach is error-prone and leads to less portable code. Hence, there has been an increasing interest in SIMT programming tools allowing the simultaneous use of x86 cores and vector units while providing programmability and code portability. However, the effective implementation of the SIMT model on these hybrid architectures is not well understood. In this work, we target this problem. First, we propose a set of compiler techniques to transform programs written using a SIMT programming model (a subset of CUDA C) into code that leverages both the x86 cores and the vector units of a hybrid MIMD/SIMD architecture, thus providing programmability, high system utilization and performance. Second, we evaluate the proposed techniques on Xeon Phi and Skylake processors using micro-benchmarks and real-world applications. Third, we compare the resulting performance with that achieved by the same code on GPUs. Based on this analysis, we point out the main challenges in supporting the SIMT model on hybrid MIMD/SIMD architectures, while providing performance comparable to that of SIMT systems (e.g., GPUs).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 25th International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量