Fcd-cnn：使用 CNN 为 HEVC 内编码器做出基于 FPGA 的 CU 深度决策

IF 2.9 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Real-Time Image Processing Pub Date : 2024-06-02 DOI:10.1007/s11554-024-01487-9

Hossein Dehnavi, Mohammad Dehnavi, Sajad Haghzad Klidbary

{"title":"Fcd-cnn：使用 CNN 为 HEVC 内编码器做出基于 FPGA 的 CU 深度决策","authors":"Hossein Dehnavi, Mohammad Dehnavi, Sajad Haghzad Klidbary","doi":"10.1007/s11554-024-01487-9","DOIUrl":null,"url":null,"abstract":"Video compression for storage and transmission has always been a focal point for researchers in the field of image processing. Their efforts aim to reduce the data volume required for video representation while maintaining its quality. HEVC is one of the efficient standards for video compression, receiving special attention due to the increasing demand for high-resolution videos. The main step in video compression involves dividing the coding unit (CU) blocks into smaller blocks that have a uniform texture. In traditional methods, The Discrete Cosine Transform (DCT) is applied, followed by the use of RDO for decision-making on partitioning. This paper presents a novel convolutional neural network (CNN) and its hardware implementation as an alternative to DCT, aimed at speeding up partitioning and reducing the hardware resources required. The proposed hardware utilizes an efficient and lightweight CNN to partition CUs with low hardware resources in real-time applications. This CNN is trained for different Quantization Parameters (QPs) and block sizes to prevent overfitting. Furthermore, the system’s input size is fixed at \\(16\\times 16\\), and other input sizes are scaled to this dimension. Loop unrolling, data reuse, and resource sharing are applied in hardware implementation to save resources. The hardware architecture is fixed for all block sizes and QPs, and only the coefficients of the CNN are changed. In terms of compression quality, the proposed hardware achieves a \\(4.42\\%\\) BD-BR and \\(-\\,0.19\\) BD-PSNR compared to HM16.5. The proposed system can process \\(64\\times 64\\) CU at 150 MHz and in 4914 clock cycles. The hardware resources utilized by the proposed system include 13,141 LUTs, 15,885 Flip-flops, 51 BRAMs, and 74 DSPs.","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"70 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fcd-cnn: FPGA-based CU depth decision for HEVC intra encoder using CNN\",\"authors\":\"Hossein Dehnavi, Mohammad Dehnavi, Sajad Haghzad Klidbary\",\"doi\":\"10.1007/s11554-024-01487-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video compression for storage and transmission has always been a focal point for researchers in the field of image processing. Their efforts aim to reduce the data volume required for video representation while maintaining its quality. HEVC is one of the efficient standards for video compression, receiving special attention due to the increasing demand for high-resolution videos. The main step in video compression involves dividing the coding unit (CU) blocks into smaller blocks that have a uniform texture. In traditional methods, The Discrete Cosine Transform (DCT) is applied, followed by the use of RDO for decision-making on partitioning. This paper presents a novel convolutional neural network (CNN) and its hardware implementation as an alternative to DCT, aimed at speeding up partitioning and reducing the hardware resources required. The proposed hardware utilizes an efficient and lightweight CNN to partition CUs with low hardware resources in real-time applications. This CNN is trained for different Quantization Parameters (QPs) and block sizes to prevent overfitting. Furthermore, the system’s input size is fixed at \\\\(16\\\\times 16\\\\), and other input sizes are scaled to this dimension. Loop unrolling, data reuse, and resource sharing are applied in hardware implementation to save resources. The hardware architecture is fixed for all block sizes and QPs, and only the coefficients of the CNN are changed. In terms of compression quality, the proposed hardware achieves a \\\\(4.42\\\\%\\\\) BD-BR and \\\\(-\\\\,0.19\\\\) BD-PSNR compared to HM16.5. The proposed system can process \\\\(64\\\\times 64\\\\) CU at 150 MHz and in 4914 clock cycles. The hardware resources utilized by the proposed system include 13,141 LUTs, 15,885 Flip-flops, 51 BRAMs, and 74 DSPs.\",\"PeriodicalId\":51224,\"journal\":{\"name\":\"Journal of Real-Time Image Processing\",\"volume\":\"70 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Real-Time Image Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11554-024-01487-9\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Real-Time Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11554-024-01487-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

用于存储和传输的视频压缩一直是图像处理领域研究人员关注的焦点。他们的努力旨在减少视频表示所需的数据量，同时保持视频质量。HEVC 是高效的视频压缩标准之一，由于对高分辨率视频的需求日益增长，它受到了特别关注。视频压缩的主要步骤是将编码单元（CU）块划分为具有统一纹理的较小块。在传统方法中，首先应用离散余弦变换（DCT），然后使用 RDO 对分割进行决策。本文介绍了一种新型卷积神经网络（CNN）及其硬件实现，作为 DCT 的替代方案，旨在加快分割速度并减少所需的硬件资源。拟议的硬件利用高效、轻量级的 CNN，在实时应用中以较低的硬件资源对 CU 进行分区。该 CNN 针对不同的量化参数（QPs）和块大小进行训练，以防止过度拟合。此外，系统的输入大小固定为 \(16\times 16\) ，其他输入大小也按此维度缩放。为了节省资源，硬件实现中采用了循环解卷、数据重用和资源共享等方法。对于所有的块大小和 QPs，硬件架构都是固定的，只改变 CNN 的系数。在压缩质量方面，与HM16.5相比，所提出的硬件实现了BD-BR和BD-PSNR。提议的系统可以在150 MHz和4914个时钟周期内处理64次CU。拟议系统使用的硬件资源包括 13,141 个 LUT、15,885 个触发器、51 个 BRAM 和 74 个 DSP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Fcd-cnn: FPGA-based CU depth decision for HEVC intra encoder using CNN

查看原文本刊更多论文

Fcd-cnn: FPGA-based CU depth decision for HEVC intra encoder using CNN

Video compression for storage and transmission has always been a focal point for researchers in the field of image processing. Their efforts aim to reduce the data volume required for video representation while maintaining its quality. HEVC is one of the efficient standards for video compression, receiving special attention due to the increasing demand for high-resolution videos. The main step in video compression involves dividing the coding unit (CU) blocks into smaller blocks that have a uniform texture. In traditional methods, The Discrete Cosine Transform (DCT) is applied, followed by the use of RDO for decision-making on partitioning. This paper presents a novel convolutional neural network (CNN) and its hardware implementation as an alternative to DCT, aimed at speeding up partitioning and reducing the hardware resources required. The proposed hardware utilizes an efficient and lightweight CNN to partition CUs with low hardware resources in real-time applications. This CNN is trained for different Quantization Parameters (QPs) and block sizes to prevent overfitting. Furthermore, the system’s input size is fixed at \(16\times 16\), and other input sizes are scaled to this dimension. Loop unrolling, data reuse, and resource sharing are applied in hardware implementation to save resources. The hardware architecture is fixed for all block sizes and QPs, and only the coefficients of the CNN are changed. In terms of compression quality, the proposed hardware achieves a \(4.42\%\) BD-BR and \(-\,0.19\) BD-PSNR compared to HM16.5. The proposed system can process \(64\times 64\) CU at 150 MHz and in 4914 clock cycles. The hardware resources utilized by the proposed system include 13,141 LUTs, 15,885 Flip-flops, 51 BRAMs, and 74 DSPs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Real-Time Image Processing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

6.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Due to rapid advancements in integrated circuit technology, the rich theoretical results that have been developed by the image and video processing research community are now being increasingly applied in practical systems to solve real-world image and video processing problems. Such systems involve constraints placed not only on their size, cost, and power consumption, but also on the timeliness of the image data processed. Examples of such systems are mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-guided autonomous robots, spectral imaging systems, and many other real-time embedded systems. In these real-time systems, strict timing requirements demand that results are available within a certain interval of time as imposed by the application. It is often the case that an image processing algorithm is developed and proven theoretically sound, presumably with a specific application in mind, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are not fully explored, leaving these critical and usually non-trivial issues for those wishing to employ the algorithm in a real-time system. The Journal of Real-Time Image Processing is intended to bridge the gap between the theory and practice of image processing, serving the greater community of researchers, practicing engineers, and industrial professionals who deal with designing, implementing or utilizing image processing systems which must satisfy real-time design constraints.