Bridging Architecture and Programming for Throughput-Oriented Vision Processing (Abstract Only)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI:10.1145/2684746.2689140

Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli

{"title":"Bridging Architecture and Programming for Throughput-Oriented Vision Processing (Abstract Only)","authors":"Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli","doi":"10.1145/2684746.2689140","DOIUrl":null,"url":null,"abstract":"With the expansion of OpenCL support across many heterogeneous devices (including FPGAs, GPUs and CPUs), the programmability of these systems has been significantly increased. At the same time, new questions arise about which device should be targeted for each OpenCL software kernel. Once we select a device, then we are left to customize the application, selecting the right granularity of parallelism and frequency of host-to-device communication. In this paper, we study the impact of source-level decisions on the overall execution time when developing OpenCL program across different heterogeneous devices. We focus on two mainstream architecture classes (GPUs and FPGAs), and consider throughput-oriented advanced vision processing. To guide this exploration, we propose a new vertical classification for selecting the grain of parallelism for advanced vision processing applications. To carry out this study we have selected the Mean-shift object tracking algorithm as a representative candidate of advanced vision algorithms. Overall, our evaluation demonstrates that fine-grained parallelism can greatly benefit FPGA execution (up to a 4X speed-up), while a combination of coarse-grained and fine-grained parallelism achieves the best performance on a GPU (up to a 6X speed-up). Also, there can be a large benefit if we can execute both the parallel and serial parts of the program on a FPGA (up to a 21X speed-up).","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the expansion of OpenCL support across many heterogeneous devices (including FPGAs, GPUs and CPUs), the programmability of these systems has been significantly increased. At the same time, new questions arise about which device should be targeted for each OpenCL software kernel. Once we select a device, then we are left to customize the application, selecting the right granularity of parallelism and frequency of host-to-device communication. In this paper, we study the impact of source-level decisions on the overall execution time when developing OpenCL program across different heterogeneous devices. We focus on two mainstream architecture classes (GPUs and FPGAs), and consider throughput-oriented advanced vision processing. To guide this exploration, we propose a new vertical classification for selecting the grain of parallelism for advanced vision processing applications. To carry out this study we have selected the Mean-shift object tracking algorithm as a representative candidate of advanced vision algorithms. Overall, our evaluation demonstrates that fine-grained parallelism can greatly benefit FPGA execution (up to a 4X speed-up), while a combination of coarse-grained and fine-grained parallelism achieves the best performance on a GPU (up to a 6X speed-up). Also, there can be a large benefit if we can execute both the parallel and serial parts of the program on a FPGA (up to a 21X speed-up).

查看原文本刊更多论文

面向吞吐量的视觉处理桥接架构与编程(仅摘要)

随着OpenCL支持在许多异构设备(包括fpga、gpu和cpu)上的扩展，这些系统的可编程性得到了显著提高。与此同时，关于每个OpenCL软件内核应该针对哪个设备的新问题出现了。一旦选择了设备，我们就可以定制应用程序，选择合适的并行度粒度和主机到设备通信的频率。在本文中，我们研究了在跨不同异构设备开发OpenCL程序时，源级决策对总体执行时间的影响。我们专注于两种主流架构类(gpu和fpga)，并考虑面向吞吐量的高级视觉处理。为了指导这一探索，我们提出了一种新的垂直分类，用于选择高级视觉处理应用中的并行粒度。为了进行这项研究，我们选择了Mean-shift目标跟踪算法作为先进视觉算法的代表性候选。总的来说，我们的评估表明，细粒度并行性可以极大地有利于FPGA的执行(高达4倍的加速)，而粗粒度和细粒度并行性的组合在GPU上实现了最佳性能(高达6倍的加速)。此外，如果我们可以在FPGA上同时执行程序的并行和串行部分(高达21倍的加速)，可能会有很大的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量