{"title":"基于经验优化的GPU基数排序方法","authors":"Bonan Huang, Jinlan Gao, Xiaoming Li","doi":"10.1109/ISPA.2009.89","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\\cite{Whaley}, FFTW~\\cite{FFTW_org}, SPIRAL~\\cite{Pueschel:05} and X-Sort~\\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\\%.","PeriodicalId":346815,"journal":{"name":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"An Empirically Optimized Radix Sort for GPU\",\"authors\":\"Bonan Huang, Jinlan Gao, Xiaoming Li\",\"doi\":\"10.1109/ISPA.2009.89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\\\\cite{Whaley}, FFTW~\\\\cite{FFTW_org}, SPIRAL~\\\\cite{Pueschel:05} and X-Sort~\\\\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\\\\%.\",\"PeriodicalId\":346815,\"journal\":{\"name\":\"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPA.2009.89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2009.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
摘要
支持通用程序的图形处理单元(Graphics Processing unit, gpu)是一种很有前途的高性能计算平台。然而,GPU与CPU在架构上的根本差异、GPU平台的复杂性以及GPU规格的多样性,使得为GPU生成高效的代码变得越来越困难。手动代码生成非常耗时,而且结果往往难以调试和维护。另一方面,由今天的GPU编译器生成的代码通常比最好的手动调优代码的性能低得多。由ATLAS \cite{Whaley}、FFTW \cite{FFTW_org}、SPIRAL \cite{Pueschel:05}和X-Sort \cite{Li:05}等系统实现的一种很有前途的代码生成策略,使用经验搜索来找到实现的参数值,例如块大小和指令时间表,为特定机器提供接近最佳的性能。然而,这种方法只有在应用于CPU时才被证明是成功的,因为CPU程序的性能已经得到了相对更好的理解。显然,经验搜索必须扩展到GPU上的通用程序。在本文中,我们提出了一种经验优化技术,用于GPU上最重要的排序例程之一,基数排序,该技术可为具有各种架构规范的许多具有代表性的NVIDIA GPU生成高效代码。我们的研究主要集中在可以适应不同环境的基数排序算法参数和影响基数排序性能的GPU架构因素。我们提出了一个强大的经验优化方法,该方法被证明能够为不同的NVIDIA gpu找到高效的代码。我们的结果表明,这种经验优化方法在考虑到架构特征之间的复杂交互方面非常有效,并且结果代码的性能明显优于两个基数排序实现,这两个实现的性能已经被证明优于其他GPU排序例程,最大加速提升了33.4%。
Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\cite{Whaley}, FFTW~\cite{FFTW_org}, SPIRAL~\cite{Pueschel:05} and X-Sort~\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\%.