Patrik Goorts, S. Rogmans, Steven Vanden Eynde, P. Bekaert
{"title":"GPU计算优化原理的实例","authors":"Patrik Goorts, S. Rogmans, Steven Vanden Eynde, P. Bekaert","doi":"10.5220/0002990400460049","DOIUrl":null,"url":null,"abstract":"In this paper, we provide examples to optimize signal processing or visual computing algorithms written for SIMT-based GPU architectures. These implementations demonstrate the optimizations for CUDA or its successors OpenCL and DirectCompute. We discuss the effect and optimization principles of memory coalescing, bandwidth reduction, processor occupancy, bank conflict reduction, local memory elimination and instruction optimization. The effect of the optimization steps are illustrated by state-of-the-art examples. A comparison with optimized and unoptimized algorithms is provided. A first example discusses the construction of joint histograms using shared memory, where optimizations lead to a significant speedup compared to the original implementation. A second example presents convolution and the acquired results.","PeriodicalId":408116,"journal":{"name":"2010 International Conference on Signal Processing and Multimedia Applications (SIGMAP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Practical examples of GPU computing optimization principles\",\"authors\":\"Patrik Goorts, S. Rogmans, Steven Vanden Eynde, P. Bekaert\",\"doi\":\"10.5220/0002990400460049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we provide examples to optimize signal processing or visual computing algorithms written for SIMT-based GPU architectures. These implementations demonstrate the optimizations for CUDA or its successors OpenCL and DirectCompute. We discuss the effect and optimization principles of memory coalescing, bandwidth reduction, processor occupancy, bank conflict reduction, local memory elimination and instruction optimization. The effect of the optimization steps are illustrated by state-of-the-art examples. A comparison with optimized and unoptimized algorithms is provided. A first example discusses the construction of joint histograms using shared memory, where optimizations lead to a significant speedup compared to the original implementation. A second example presents convolution and the acquired results.\",\"PeriodicalId\":408116,\"journal\":{\"name\":\"2010 International Conference on Signal Processing and Multimedia Applications (SIGMAP)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 International Conference on Signal Processing and Multimedia Applications (SIGMAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0002990400460049\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Signal Processing and Multimedia Applications (SIGMAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0002990400460049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Practical examples of GPU computing optimization principles
In this paper, we provide examples to optimize signal processing or visual computing algorithms written for SIMT-based GPU architectures. These implementations demonstrate the optimizations for CUDA or its successors OpenCL and DirectCompute. We discuss the effect and optimization principles of memory coalescing, bandwidth reduction, processor occupancy, bank conflict reduction, local memory elimination and instruction optimization. The effect of the optimization steps are illustrated by state-of-the-art examples. A comparison with optimized and unoptimized algorithms is provided. A first example discusses the construction of joint histograms using shared memory, where optimizations lead to a significant speedup compared to the original implementation. A second example presents convolution and the acquired results.