基于gpu的快速积分图像生成算法

Qingqing Dang, Shengen Yan, Ren Wu
{"title":"基于gpu的快速积分图像生成算法","authors":"Qingqing Dang, Shengen Yan, Ren Wu","doi":"10.1109/PADSW.2014.7097862","DOIUrl":null,"url":null,"abstract":"Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A fast integral image generation algorithm on GPUs\",\"authors\":\"Qingqing Dang, Shengen Yan, Ren Wu\",\"doi\":\"10.1109/PADSW.2014.7097862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.\",\"PeriodicalId\":421740,\"journal\":{\"name\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PADSW.2014.7097862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

积分图像,也称为求和面积表,是由输入图像生成的二维表格。表中的每个条目存储输入图像中位于条目左上角的所有像素的总和。积分图像算法是计算机视觉和计算机图形学应用中非常流行的一种重要算法。特别是在实时计算机视觉中,通常用于加速计算矩形面积的和。积分图像算法是有内存限制的。目前在图形处理器上有两种典型的图像积分算法。第一个是扫描-扫描算法。二是扫描-转置-扫描算法,该算法采用三步生成积分图像。第一步和第三步是扫描。为了在第三步中实现合并的全局内存访问,增加了一个转置步骤。本文提出了一种新的块积分算法,该算法分为三个阶段。第一阶段是块内缩减。第二阶段为辅助矩阵扫描,第三阶段为块内扫描。与Scan-Scan算法相比,我们提出的方案减少了全局内存访问。同时,实现了更少的本地同步和更少的负载不平衡。与扫描-转置-扫描算法相比,我们提出的算法只需要大约一半的全局内存访问。同时,实现了合并内存访问。我们用OpenCL实现了这三种算法,这样它们就可以在Nvidia和AMD的gpu上运行。我们还设计了一个自动调整框架,在这两个平台上搜索不同大小的输入矩阵的最优参数。实验结果表明,与现有的两种典型的积分算法相比,本文提出的算法具有最好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A fast integral image generation algorithm on GPUs
Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信