Gustavo Leite, A. Baldassin, G. Araújo, J. N. Amaral
{"title":"FPGA加速器中编译器优化的性能评估","authors":"Gustavo Leite, A. Baldassin, G. Araújo, J. N. Amaral","doi":"10.5753/wscad.2019.8681","DOIUrl":null,"url":null,"abstract":"With the increasing power wall in microprocessor design, engineers shifted their attention to heterogeneous architectures, wherein several classes of devices are used for computation. Among them are FPGAs which offer comparable performance to CPUs while consuming only a fraction of energy. Despite the increasing interest in these devices, programmability and performance engineering in FPGAs remain hard. This work presents an evaluation of the most prominent code transformations targeting FPGAs. More specifically, it studies the performance effect of unrolling loops, replicating compute units and transferring data using DMA in a matrix multiplication OpenCL kernel through an Intel® FPGA. The results indicate that these optimizations can achieve speedups up to 3.78× for a matrix multiplication application, and 412.5× speedup in data transfer.","PeriodicalId":117711,"journal":{"name":"Anais do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"2007 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Compiler Optimizations in FPGA Accelerators\",\"authors\":\"Gustavo Leite, A. Baldassin, G. Araújo, J. N. Amaral\",\"doi\":\"10.5753/wscad.2019.8681\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the increasing power wall in microprocessor design, engineers shifted their attention to heterogeneous architectures, wherein several classes of devices are used for computation. Among them are FPGAs which offer comparable performance to CPUs while consuming only a fraction of energy. Despite the increasing interest in these devices, programmability and performance engineering in FPGAs remain hard. This work presents an evaluation of the most prominent code transformations targeting FPGAs. More specifically, it studies the performance effect of unrolling loops, replicating compute units and transferring data using DMA in a matrix multiplication OpenCL kernel through an Intel® FPGA. The results indicate that these optimizations can achieve speedups up to 3.78× for a matrix multiplication application, and 412.5× speedup in data transfer.\",\"PeriodicalId\":117711,\"journal\":{\"name\":\"Anais do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)\",\"volume\":\"2007 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/wscad.2019.8681\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/wscad.2019.8681","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Evaluation of Compiler Optimizations in FPGA Accelerators
With the increasing power wall in microprocessor design, engineers shifted their attention to heterogeneous architectures, wherein several classes of devices are used for computation. Among them are FPGAs which offer comparable performance to CPUs while consuming only a fraction of energy. Despite the increasing interest in these devices, programmability and performance engineering in FPGAs remain hard. This work presents an evaluation of the most prominent code transformations targeting FPGAs. More specifically, it studies the performance effect of unrolling loops, replicating compute units and transferring data using DMA in a matrix multiplication OpenCL kernel through an Intel® FPGA. The results indicate that these optimizations can achieve speedups up to 3.78× for a matrix multiplication application, and 412.5× speedup in data transfer.