{"title":"基于gpu的快速积分图像生成算法","authors":"Qingqing Dang, Shengen Yan, Ren Wu","doi":"10.1109/PADSW.2014.7097862","DOIUrl":null,"url":null,"abstract":"Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A fast integral image generation algorithm on GPUs\",\"authors\":\"Qingqing Dang, Shengen Yan, Ren Wu\",\"doi\":\"10.1109/PADSW.2014.7097862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.\",\"PeriodicalId\":421740,\"journal\":{\"name\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PADSW.2014.7097862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A fast integral image generation algorithm on GPUs
Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially in real-time computer vision, it is usually used to accelerate calculating the sum of a rectangular area. Integral image algorithm is memory-bounded. There are two typical existed image integral algorithms on GPUs. The first is the Scan-Scan algorithm. The second is the Scan-Transpose-Scan algorithm, which adopts three steps to generate the integral image. The first and the third steps are scan. In order to achieve coalesced global memory access in the third step, a transpose step is added. In this paper, we propose a novel blocked integral algorithm, which has three stages. The first stage is intra-block reduction. The second stage is auxiliary matrix scan and the third stage is intra-block scan. Compared with the Scan-Scan algorithm, our proposed scheme reduces the global memory accesses. At the same time, less local synchronizations and less load imbalance are achieved. Compared with the Scan-Transpose-Scan algorithm, our proposed algorithm only needs about half of the global memory accesses. At the same time, coalesced memory access is achieved. We implemented these three algorithms with OpenCL so that they can run on both Nvidia and AMD GPUs. We also designed an auto-tuning framework to search optimal parameters for different size of input matrix on those two platforms. The experiment result shows that our proposed algorithm gets the best performance compared with the two existed typical integral algorithms.