CGRA与移动GPU光场图像处理性能比较

2016 Fourth International Symposium on Computing and Networking (CANDAR) Pub Date : 2016-11-01 DOI:10.1109/CANDAR.2016.0040

Yuttakon Yuttakonkit, Y. Nakashima

{"title":"CGRA与移动GPU光场图像处理性能比较","authors":"Yuttakon Yuttakonkit, Y. Nakashima","doi":"10.1109/CANDAR.2016.0040","DOIUrl":null,"url":null,"abstract":"Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations generate enormous short-burst memory transfers in the cache system, cost high latency, and do not fully utilize the high memory bandwidth of GPU. Therefore, an alternative architecture that exploits a long-burst data transmission, which improves the memory bandwidth utilization, is essential. We propose a sparse stencil oriented Coarse Grain Reconfigurable Accelerator (CGRA) that we call EMAXV. Unlike on-demand multiple data loading on GPU, EMAXV loads the input data with a long burst transferring before the execution proceeds to conceal the sparse memory access and multi-threading cache races. It further obscures the memory loading latency with an execution latency from different activations. We evaluated the EMAXV and mobile GPU (Tegra K1) performances with identical host CPU's frequency and main memory bandwidth. Although EMAXV has much lower computation capability, we achieved four times performance of mobile GPU for light-field depth extraction and 89% of the performance for light-field image rendering.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performance Comparison of CGRA and Mobile GPU for Light-Field Image Processing\",\"authors\":\"Yuttakon Yuttakonkit, Y. Nakashima\",\"doi\":\"10.1109/CANDAR.2016.0040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations generate enormous short-burst memory transfers in the cache system, cost high latency, and do not fully utilize the high memory bandwidth of GPU. Therefore, an alternative architecture that exploits a long-burst data transmission, which improves the memory bandwidth utilization, is essential. We propose a sparse stencil oriented Coarse Grain Reconfigurable Accelerator (CGRA) that we call EMAXV. Unlike on-demand multiple data loading on GPU, EMAXV loads the input data with a long burst transferring before the execution proceeds to conceal the sparse memory access and multi-threading cache races. It further obscures the memory loading latency with an execution latency from different activations. We evaluated the EMAXV and mobile GPU (Tegra K1) performances with identical host CPU's frequency and main memory bandwidth. Although EMAXV has much lower computation capability, we achieved four times performance of mobile GPU for light-field depth extraction and 89% of the performance for light-field image rendering.\",\"PeriodicalId\":322499,\"journal\":{\"name\":\"2016 Fourth International Symposium on Computing and Networking (CANDAR)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Fourth International Symposium on Computing and Networking (CANDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CANDAR.2016.0040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

最近，许多方法将光场图像处理应用于智能手机和可穿戴设备。图形处理单元(GPU)通常用于开发这种图像处理中的并行性。但是，由于光场应用程序中的访问模式比典型的模板应用程序更稀疏，并且不使用缓存行中的所有数据。此外，对多个位置的数据请求在缓存系统中产生大量的短突发内存传输，成本高延迟，并且没有充分利用GPU的高内存带宽。因此，一种利用长突发数据传输的替代体系结构是必不可少的，它可以提高内存带宽利用率。我们提出了一种面向稀疏模板的粗粒度可重构加速器(CGRA)，我们称之为EMAXV。与GPU上的按需多数据加载不同，EMAXV在执行之前通过长时间的突发传输加载输入数据，以隐藏稀疏的内存访问和多线程缓存竞争。它通过不同激活的执行延迟进一步模糊了内存加载延迟。我们评估了EMAXV和移动GPU (Tegra K1)在相同的主机CPU频率和主存带宽下的性能。虽然EMAXV的计算能力要低得多，但我们在光场深度提取方面的性能是移动GPU的4倍，在光场图像渲染方面的性能是移动GPU的89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Comparison of CGRA and Mobile GPU for Light-Field Image Processing

Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations generate enormous short-burst memory transfers in the cache system, cost high latency, and do not fully utilize the high memory bandwidth of GPU. Therefore, an alternative architecture that exploits a long-burst data transmission, which improves the memory bandwidth utilization, is essential. We propose a sparse stencil oriented Coarse Grain Reconfigurable Accelerator (CGRA) that we call EMAXV. Unlike on-demand multiple data loading on GPU, EMAXV loads the input data with a long burst transferring before the execution proceeds to conceal the sparse memory access and multi-threading cache races. It further obscures the memory loading latency with an execution latency from different activations. We evaluated the EMAXV and mobile GPU (Tegra K1) performances with identical host CPU's frequency and main memory bandwidth. Although EMAXV has much lower computation capability, we achieved four times performance of mobile GPU for light-field depth extraction and 89% of the performance for light-field image rendering.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Fourth International Symposium on Computing and Networking (CANDAR)

自引率

0.00%

发文量