{"title":"TFT的高性能实现","authors":"Lingchuan Meng, Jeremy R. Johnson","doi":"10.1145/2608628.2608661","DOIUrl":null,"url":null,"abstract":"This paper reports on a high-performance implementation of the truncated Fourier transform (TFT). A general Cooley-Tukey like algorithm for the TFT is developed that allows the implementation to automatically adapt to the memory hierarchy. Then the algorithm introduces a small relaxation for larger transform sizes which trades off slightly higher arithmetic cost for improved data flow which allows full vectorization and parallelization. The implementation is automatically derived and tuned using the SPIRAL system for code generation and adaptation. The resulting arbitrary-size TFT library smooths out the staircase performance associated with power-of-two modular FFT implementations while retaining the performance associated with state-of-the-art FFT libraries. This provides significant performance improvement over approaches that pad to the next power of two even when using high-performance FFT libraries.","PeriodicalId":243282,"journal":{"name":"International Symposium on Symbolic and Algebraic Computation","volume":"82 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"High performance implementation of the TFT\",\"authors\":\"Lingchuan Meng, Jeremy R. Johnson\",\"doi\":\"10.1145/2608628.2608661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper reports on a high-performance implementation of the truncated Fourier transform (TFT). A general Cooley-Tukey like algorithm for the TFT is developed that allows the implementation to automatically adapt to the memory hierarchy. Then the algorithm introduces a small relaxation for larger transform sizes which trades off slightly higher arithmetic cost for improved data flow which allows full vectorization and parallelization. The implementation is automatically derived and tuned using the SPIRAL system for code generation and adaptation. The resulting arbitrary-size TFT library smooths out the staircase performance associated with power-of-two modular FFT implementations while retaining the performance associated with state-of-the-art FFT libraries. This provides significant performance improvement over approaches that pad to the next power of two even when using high-performance FFT libraries.\",\"PeriodicalId\":243282,\"journal\":{\"name\":\"International Symposium on Symbolic and Algebraic Computation\",\"volume\":\"82 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Symposium on Symbolic and Algebraic Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2608628.2608661\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Symposium on Symbolic and Algebraic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2608628.2608661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper reports on a high-performance implementation of the truncated Fourier transform (TFT). A general Cooley-Tukey like algorithm for the TFT is developed that allows the implementation to automatically adapt to the memory hierarchy. Then the algorithm introduces a small relaxation for larger transform sizes which trades off slightly higher arithmetic cost for improved data flow which allows full vectorization and parallelization. The implementation is automatically derived and tuned using the SPIRAL system for code generation and adaptation. The resulting arbitrary-size TFT library smooths out the staircase performance associated with power-of-two modular FFT implementations while retaining the performance associated with state-of-the-art FFT libraries. This provides significant performance improvement over approaches that pad to the next power of two even when using high-performance FFT libraries.