Anumeena Sorna, Xiao-he Cheng, E. D'Azevedo, Kwai Wong, S. Tomov
{"title":"基于张量核心硬件的混合精度快速傅里叶变换优化","authors":"Anumeena Sorna, Xiao-he Cheng, E. D'Azevedo, Kwai Wong, S. Tomov","doi":"10.1109/HIPCW.2018.8634417","DOIUrl":null,"url":null,"abstract":"The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.","PeriodicalId":401060,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","volume":"67 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"Optimizing the Fast Fourier Transform Using Mixed Precision on Tensor Core Hardware\",\"authors\":\"Anumeena Sorna, Xiao-he Cheng, E. D'Azevedo, Kwai Wong, S. Tomov\",\"doi\":\"10.1109/HIPCW.2018.8634417\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.\",\"PeriodicalId\":401060,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)\",\"volume\":\"67 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HIPCW.2018.8634417\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HIPCW.2018.8634417","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimizing the Fast Fourier Transform Using Mixed Precision on Tensor Core Hardware
The Fast Fourier Transform is a fundamental tool in scientific and technical computation. The highly parallelizable nature of the algorithm makes it a suitable candidate for GPU acceleration. This paper focuses on exploiting the speedup due to using the half precision multiplication capability of the latest GPUs' tensor core hardware without significantly degrading the precision of the Fourier Transform result. We develop an algorithm that dynamically splits the input single precision dataset into two half precision sets at the lowest level, uses half precision multiplication, and recombines the result at a later step. This work paves the way for using tensor cores for high precision inputs.