Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2020-12-01 DOI:10.1109/ICPADS51040.2020.00045

Lipeng Wang, Qiong Luo, Shengen Yan

{"title":"Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding","authors":"Lipeng Wang, Qiong Luo, Shengen Yan","doi":"10.1109/ICPADS51040.2020.00045","DOIUrl":null,"url":null,"abstract":"In computer vision deep learning (DL) tasks, most of the input image datasets are stored in the JPEG format. These JPEG datasets need to be decoded before DL tasks are performed on them. We observe two problems in the current JPEG decoding procedures for DL tasks: (1) the decoding of image entropy data in the decoder is performed sequentially, and this sequential decoding repeats with the DL iterations, which takes significant time; (2) Current parallel decoding methods under-utilize the massive hardware threads on GPUs. To reduce the image decoding time, we introduce a pre-scan mechanism to avoid the repeated image scanning in DL tasks. Our pre-scan generates boundary markers for entropy data so that the decoding can be performed in parallel. To cooperate with the existing dataset storage and caching systems, we propose two modes of the pre-scan mechanism: a compatible mode and a fast mode. The compatible mode does not change the image file structure so pre-scanned files can be stored back to disk for subsequent DL tasks. In comparison, the fast mode crafts a JPEG image into a binary format suitable for parallel decoding, which can be processed directly on the GPU. Since the GPU has thousands of hardware threads, we propose a fine-grained parallel decoding method on the pre-scanned dataset. The fine-grained parallelism utilizes the GPU effectively, and achieves speedups of around 1.5× over existing GPU-assisted image decoding libraries on real-world DL tasks.","PeriodicalId":196548,"journal":{"name":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS51040.2020.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

In computer vision deep learning (DL) tasks, most of the input image datasets are stored in the JPEG format. These JPEG datasets need to be decoded before DL tasks are performed on them. We observe two problems in the current JPEG decoding procedures for DL tasks: (1) the decoding of image entropy data in the decoder is performed sequentially, and this sequential decoding repeats with the DL iterations, which takes significant time; (2) Current parallel decoding methods under-utilize the massive hardware threads on GPUs. To reduce the image decoding time, we introduce a pre-scan mechanism to avoid the repeated image scanning in DL tasks. Our pre-scan generates boundary markers for entropy data so that the decoding can be performed in parallel. To cooperate with the existing dataset storage and caching systems, we propose two modes of the pre-scan mechanism: a compatible mode and a fast mode. The compatible mode does not change the image file structure so pre-scanned files can be stored back to disk for subsequent DL tasks. In comparison, the fast mode crafts a JPEG image into a binary format suitable for parallel decoding, which can be processed directly on the GPU. Since the GPU has thousands of hardware threads, we propose a fine-grained parallel decoding method on the pre-scanned dataset. The fine-grained parallelism utilizes the GPU effectively, and achieves speedups of around 1.5× over existing GPU-assisted image decoding libraries on real-world DL tasks.

查看原文本刊更多论文

利用优化的gpu辅助图像解码加速深度学习任务

在计算机视觉深度学习(DL)任务中，大多数输入图像数据集都以JPEG格式存储。在对这些JPEG数据集执行DL任务之前，需要对它们进行解码。我们观察到目前用于深度学习任务的JPEG解码过程中存在两个问题:(1)解码器中图像熵数据的解码是顺序进行的，并且这种顺序解码随着深度学习迭代而重复，这需要大量的时间;(2)当前的并行解码方法没有充分利用gpu上的大量硬件线程。为了减少图像解码时间，我们引入了一种预扫描机制来避免DL任务中的重复图像扫描。我们的预扫描为熵数据生成边界标记，以便可以并行执行解码。为了配合现有的数据集存储和缓存系统，我们提出了两种预扫描模式:兼容模式和快速模式。兼容模式不会改变映像文件结构，因此可以将预扫描的文件存储回磁盘，用于后续的DL任务。相比之下，快速模式将JPEG图像转换成适合并行解码的二进制格式，可以直接在GPU上处理。由于GPU有数千个硬件线程，我们提出了一种对预扫描数据集进行细粒度并行解码的方法。细粒度并行性有效地利用了GPU，在现实世界的DL任务中，比现有的GPU辅助图像解码库实现了大约1.5倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量