OpenCL的cpu性能陷阱

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2013-02-27 DOI:10.1109/PDP.2013.16

Jie Shen, Jianbin Fang, H. Sips, A. Varbanescu

{"title":"OpenCL的cpu性能陷阱","authors":"Jie Shen, Jianbin Fang, H. Sips, A. Varbanescu","doi":"10.1109/PDP.2013.16","DOIUrl":null,"url":null,"abstract":"With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: \"OpenCL is not performance portable!\" or \"Why using OpenCL for CPUs after all?!\". We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance \"traps\" that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":"{\"title\":\"Performance Traps in OpenCL for CPUs\",\"authors\":\"Jie Shen, Jianbin Fang, H. Sips, A. Varbanescu\",\"doi\":\"10.1109/PDP.2013.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: \\\"OpenCL is not performance portable!\\\" or \\\"Why using OpenCL for CPUs after all?!\\\". We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance \\\"traps\\\" that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.\",\"PeriodicalId\":202977,\"journal\":{\"name\":\"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"42\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP.2013.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2013.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 42

摘要

由于其跨平台可移植性的设计理念，OpenCL不仅可以用于gpu(它非常受欢迎)，还可以用于cpu。无论是将GPU程序移植到cpu，还是简单地为cpu编写新代码，使用OpenCL都会带来性能问题，通常以两种形式之一提出:“OpenCL不能在性能上移植!”或“为什么要在cpu上使用OpenCL ?!”我们认为，这两个问题都可以通过深入研究影响cpu上OpenCL性能的因素来解决。这一分析是本文的重点。具体来说，从多核cpu和OpenCL平台之间的两个主要架构不匹配——并行度粒度和内存模型——开始，我们确定了8个导致OpenCL cpu性能下降的性能“陷阱”。使用多个代码示例，包括合成的和实际的基准测试，我们量化了这些陷阱的影响，展示了如何避免它们可以使性能提高10倍。此外，我们指出，我们为避免这些陷阱提供的解决方案是简单而通用的代码转换，它可以很容易地被程序员或自动化工具所采用。因此，我们得出结论，一定程度的OpenCL跨平台性能可移植性，虽然确实不是给定的，但可以通过简单和通用的代码转换来实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Traps in OpenCL for CPUs

With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: "OpenCL is not performance portable!" or "Why using OpenCL for CPUs after all?!". We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance "traps" that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量