{"title":"GPU和fpga上的三维递归高斯IIR -一个加速带宽限制应用的案例","authors":"J. Cong, Muhuan Huang, Yi Zou","doi":"10.1109/SASP.2011.5941081","DOIUrl":null,"url":null,"abstract":"GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"35 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications\",\"authors\":\"J. Cong, Muhuan Huang, Yi Zou\",\"doi\":\"10.1109/SASP.2011.5941081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.\",\"PeriodicalId\":375788,\"journal\":{\"name\":\"2011 IEEE 9th Symposium on Application Specific Processors (SASP)\",\"volume\":\"35 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 9th Symposium on Application Specific Processors (SASP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SASP.2011.5941081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SASP.2011.5941081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications
GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.