{"title":"Bridging Architecture and Programming for Throughput-Oriented Vision Processing (Abstract Only)","authors":"Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli","doi":"10.1145/2684746.2689140","DOIUrl":null,"url":null,"abstract":"With the expansion of OpenCL support across many heterogeneous devices (including FPGAs, GPUs and CPUs), the programmability of these systems has been significantly increased. At the same time, new questions arise about which device should be targeted for each OpenCL software kernel. Once we select a device, then we are left to customize the application, selecting the right granularity of parallelism and frequency of host-to-device communication. In this paper, we study the impact of source-level decisions on the overall execution time when developing OpenCL program across different heterogeneous devices. We focus on two mainstream architecture classes (GPUs and FPGAs), and consider throughput-oriented advanced vision processing. To guide this exploration, we propose a new vertical classification for selecting the grain of parallelism for advanced vision processing applications. To carry out this study we have selected the Mean-shift object tracking algorithm as a representative candidate of advanced vision algorithms. Overall, our evaluation demonstrates that fine-grained parallelism can greatly benefit FPGA execution (up to a 4X speed-up), while a combination of coarse-grained and fine-grained parallelism achieves the best performance on a GPU (up to a 6X speed-up). Also, there can be a large benefit if we can execute both the parallel and serial parts of the program on a FPGA (up to a 21X speed-up).","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the expansion of OpenCL support across many heterogeneous devices (including FPGAs, GPUs and CPUs), the programmability of these systems has been significantly increased. At the same time, new questions arise about which device should be targeted for each OpenCL software kernel. Once we select a device, then we are left to customize the application, selecting the right granularity of parallelism and frequency of host-to-device communication. In this paper, we study the impact of source-level decisions on the overall execution time when developing OpenCL program across different heterogeneous devices. We focus on two mainstream architecture classes (GPUs and FPGAs), and consider throughput-oriented advanced vision processing. To guide this exploration, we propose a new vertical classification for selecting the grain of parallelism for advanced vision processing applications. To carry out this study we have selected the Mean-shift object tracking algorithm as a representative candidate of advanced vision algorithms. Overall, our evaluation demonstrates that fine-grained parallelism can greatly benefit FPGA execution (up to a 4X speed-up), while a combination of coarse-grained and fine-grained parallelism achieves the best performance on a GPU (up to a 6X speed-up). Also, there can be a large benefit if we can execute both the parallel and serial parts of the program on a FPGA (up to a 21X speed-up).