Syed Safwan Ahsan , Alireza Esmaeilzehi , M. Omair Ahmad
{"title":"OODNet: A deep blind JPEG image compression deblocking network using out-of-distribution detection","authors":"Syed Safwan Ahsan , Alireza Esmaeilzehi , M. Omair Ahmad","doi":"10.1016/j.jvcir.2024.104302","DOIUrl":"10.1016/j.jvcir.2024.104302","url":null,"abstract":"<div><div>JPEG is one of the most popular image compression techniques, with numerous applications ranging from medical imaging to surveillance systems. Since JPEG introduces the blocking artifacts to the decompressed visual signals, enhancing the quality of these images is of paramount importance. Recently, various deep neural networks have been proposed for JPEG image deblocking that can effectively reduce the blocking artifacts produced by the JPEG compression technique. However, most of these schemes could only handle decompressed images generated by a set of specific JPEG quality factor (QF) values employed in the network training process. Therefore, when the images are obtained by the JPEG QF values other than those used in the network training process, the performance of deep learning-based JPEG image deblocking schemes drops significantly. To address this, in this paper, we propose a novel deep learning-based blind JPEG image deblocking method, which employs out-of-distribution detection to perform deblocking efficiently for various quality factor (QF) values. The proposed scheme can distinguish between the decompressed images using the QF values used in the training set and those using the QF values not used in the training set, and then, a suitable deblocking strategy for generating high-quality images is developed. The proposed scheme is shown to outperform the state-of-the-art JPEG image deblocking methods for various QF values.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104302"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142534734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCPNet: Deformable Control Point Network for image enhancement","authors":"Se-Ho Lee , Seung-Wook Kim","doi":"10.1016/j.jvcir.2024.104308","DOIUrl":"10.1016/j.jvcir.2024.104308","url":null,"abstract":"<div><div>In this paper, we present a novel image enhancement network consisting of global and local color enhancement. The proposed network model is constructed using global transformation functions, which are formed by a set of piece-wise quadratic curves and a local color enhancement network based on the encoder–decoder network. To adaptively and dynamically control the ranges of each piece-wise curve, we introduce deformable control points (DCPs), which determine the overall structure of the global transformation functions. The parameters for piece-wise quadratic curve fitting and DCPs are estimated using the proposed DCP network (DCPNet). DCPNet processes a down-sampled image to derive the DCP parameters: The DCP offsets and the curve parameters. Then, we obtain a set of DCPs from the DCP offsets and connect each adjacent DCP pair by using the curve parameter to construct a global transformation function for each color channel. The original input images are then transformed based on the resulting transformation functions to obtain globally enhanced images. Finally, the intermediate image is fed into the local enhancement network, which has a U-Net architecture, to produce the spatially enhanced images. Extensive experimental results demonstrate the superiority of the proposed method over state-of-the-art methods in various image enhancement tasks, such as image retouching, tone-mapping, and underwater image enhancement.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104308"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142534735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diving deep into human action recognition in aerial videos: A survey","authors":"Surbhi Kapoor, Akashdeep Sharma, Amandeep Verma","doi":"10.1016/j.jvcir.2024.104298","DOIUrl":"10.1016/j.jvcir.2024.104298","url":null,"abstract":"<div><div>Human Action Recognition from Unmanned Aerial Vehicles is a dynamic research domain with significant benefits in scale, mobility, deployment, and covert observation. This paper offers a comprehensive review of state-of-the-art algorithms for human action recognition and provides a novel taxonomy that categorizes the reviewed methods into two broad categories: Localization based and Globalization based. These categories are defined by how actions are segmented from visual data and how their spatial and temporal structures are modeled. We examine these techniques, highlighting their strengths and limitations, and provide essential background on human action recognition, including fundamental concepts and challenges in aerial videos. Additionally, we discuss existing datasets, enabling a comparative analysis. This survey identifies gaps and suggests future research directions, serving as a catalyst for advancing human action recognition in aerial videos. To our knowledge, this is the first detailed review of this kind.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104298"},"PeriodicalIF":2.6,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142318711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiyan Jin , Yujia Chen , Fengyuan Zuo , Haonan Su , YuanLin Zhang
{"title":"Zero-CSC: Low-light image enhancement with zero-reference color self-calibration","authors":"Haiyan Jin , Yujia Chen , Fengyuan Zuo , Haonan Su , YuanLin Zhang","doi":"10.1016/j.jvcir.2024.104293","DOIUrl":"10.1016/j.jvcir.2024.104293","url":null,"abstract":"<div><p>Zero-Reference Low-Light Image Enhancement (LLIE) techniques mainly focus on grey-scale inhomogeneities, and few methods consider how to explicitly recover a dark scene to achieve enhancements in color and overall illumination. In this paper, we introduce a novel Zero-Reference Color Self-Calibration framework for enhancing low-light images, termed as Zero-CSC. It effectively emphasizes channel-wise representations that contain fine-grained color information, achieving a natural result in a progressive manner. Furthermore, we propose a Light Up (LU) module with large-kernel convolutional blocks to improve overall illumination, which is implemented with a simple U-Net and further simplified with a light-weight structure. Experiments on representative datasets show that our model consistently achieves state-of-the-art performance in image signal-to-noise ratio, structural similarity, and color accuracy, setting new records on the challenging SICE dataset with improvements of 23.7% in image signal-to-noise ratio and 5.3% in structural similarity compared to the most advanced methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104293"},"PeriodicalIF":2.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M-YOLOv8s: An improved small target detection algorithm for UAV aerial photography","authors":"Siyao Duan , Ting Wang , Tao Li , Wankou Yang","doi":"10.1016/j.jvcir.2024.104289","DOIUrl":"10.1016/j.jvcir.2024.104289","url":null,"abstract":"<div><div>The object of UAV target detection usually means small target with complicated backgrounds. In this paper, an object detection model M-YOLOv8s based on UAV aerial photography scene is proposed. Firstly, to solve the problem that the YOLOv8s model cannot adapt to small target detection, a small target detection head (STDH) module is introduced to fuse the location and appearance feature information of the shallow layers of the backbone network. Secondly, Inner-Wise intersection over union (Inner-WIoU) is designed as the boundary box regression loss, and auxiliary boundary calculation is used to accelerate the regression speed of the model. Thirdly, the structure of multi-scale feature pyramid network (MS-FPN) can effectively combine the shallow network information with the deep network information and improve the performance of the detection model. Furthermore, a multi-scale cross-spatial attention (MCSA) module is proposed to expand the feature space through multi-scale branch, and then achieves the aggregation of target features through cross-spatial interaction, which improves the ability of the model to extract target features. Finally, the experimental results show that our model does not only possess fewer parameters, but also the values of mAP<sub>0.5</sub> are 6.6% and 5.4% higher than the baseline model on the Visdrone2019 validation dataset and test dataset, respectively. Then, as a conclusion, the M-YOLOv8s model achieves better detection performance than some existing ones, indicating that our proposed method can be more suitable for detecting the small targets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104289"},"PeriodicalIF":2.6,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142315099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahao Wu , Dexin Deng , Yilin Li , Lu Yu , Kai Li , Ying Chen
{"title":"Low-complexity content-aware encoding optimization of batch video","authors":"Jiahao Wu , Dexin Deng , Yilin Li , Lu Yu , Kai Li , Ying Chen","doi":"10.1016/j.jvcir.2024.104295","DOIUrl":"10.1016/j.jvcir.2024.104295","url":null,"abstract":"<div><p>With the proliferation of short-form video traffic, video service providers are faced with the challenge of balancing video quality and bandwidth consumption while processing massive volumes of videos. The most straightforward and simplistic approach is to set uniformly encoding parameters to all videos. However, such an approach fails to consider the differences in video content, and there may be alternative encoding parameter configuration approach that can improve global coding efficiency. Finding the optimal combination of encoding parameter configurations for a batch of videos requires an amount of redundant encoding, thereby introducing significant computational costs. To address this issue, we propose a low-complexity encoding parameter prediction model that can adaptively adjust the values of the encoding parameters based on video content. The experiments show that when only changing the value of the encoding parameter CRF, our prediction model can achieve 27.04%, 6.11%, and 15.92% bit saving in terms of PSNR, SSIM, and VMAF respectively, while maintaining an acceptable complexity compared to the approach using the same CRF value.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104295"},"PeriodicalIF":2.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging occupancy map to accelerate video-based point cloud compression","authors":"Wenyu Wang, Gongchun Ding, Dandan Ding","doi":"10.1016/j.jvcir.2024.104292","DOIUrl":"10.1016/j.jvcir.2024.104292","url":null,"abstract":"<div><p>Video-based Point Cloud Compression enables point cloud streaming over the internet by converting dynamic 3D point clouds to 2D geometry and attribute videos, which are then compressed using 2D video codecs like H.266/VVC. However, the complex encoding process of H.266/VVC, such as the quadtree with nested multi-type tree (QTMT) partition, greatly hinders the practical application of V-PCC. To address this issue, we propose a fast CU partition method dedicated to V-PCC to accelerate the coding process. Specifically, we classify coding units (CUs) of projected images into three categories based on the occupancy map of a point cloud: unoccupied, partially occupied, and fully occupied. Subsequently, we employ either statistic-based rules or machine-learning models to manage the partition of each category. For unoccupied CUs, we terminate the partition directly; for partially occupied CUs with explicit directions, we selectively skip certain partition candidates; for the remaining CUs (partially occupied CUs with complex directions and fully occupied CUs), we train an edge-driven LightGBM model to predict the partition probability of each partition candidate automatically. Only partitions with high probabilities are retained for further Rate–Distortion (R–D) decisions. Comprehensive experiments demonstrate the superior performance of our proposed method: under the V-PCC common test conditions, our method reduces encoding time by 52% and 44% in geometry and attribute, respectively, while incurring only 0.68% (0.66%) BD-Rate loss in D1 (D2) measurements and 0.79% (luma) BD-Rate loss in attribute, significantly surpassing state-of-the-art works.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104292"},"PeriodicalIF":2.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SR4KVQA: Video quality assessment database and metric for 4K super-resolution","authors":"Ruidi Zheng , Xiuhua Jiang","doi":"10.1016/j.jvcir.2024.104290","DOIUrl":"10.1016/j.jvcir.2024.104290","url":null,"abstract":"<div><p>The quality assessment for 4K super-resolution (SR) videos can be conducive to the optimization of video SR algorithms. To improve the subjective and objective consistency of the SR quality assessment, a 4K video database and a blind metric are proposed in this paper. In the database SR4KVQA, there are 30 4K pristine videos, from which 600 SR 4K distorted videos with mean opinion score (MOS) labels are generated by three classic interpolation methods, six SR algorithms based on the deep neural network (DNN), and two SR algorithms based on the generative adversarial network (GAN). The benchmark experiment of the proposed database indicates that video quality assessment (VQA) of the 4K SR videos is challenging for the existing metrics. Among those metrics, the Video-Swin-Transformer backbone demonstrates tremendous potential in the VQA task. Accordingly, a blind VQA metric based on the Video-Swin-Transformer backbone is established, where the normalized loss function and optimized spatio-temporal sampling strategy are applied. The experiment result manifests that the Pearson linear correlation coefficient (PLCC) and Spearman rank-order correlation coefficient (SROCC) of the proposed metric reach 0.8011 and 0.8275 respectively on the SR4KVQA database, which outperforms or competes with the state-of-the-art VQA metrics. The database and the code proposed in this paper are available in the GitHub repository, <span><span>https://github.com/AlexReadyNico/SR4KVQA</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104290"},"PeriodicalIF":2.6,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data compensation and feature fusion for sketch based person retrieval","authors":"Yu Ye , Jun Chen , Zhihong Sun , Mithun Mukherjee","doi":"10.1016/j.jvcir.2024.104287","DOIUrl":"10.1016/j.jvcir.2024.104287","url":null,"abstract":"<div><div>Sketch re-identification (Re-ID) aims to retrieve pedestrian photo in the gallery dataset by a query sketch drawn by professionals. The sketch Re-ID task has not been adequately studied because collecting such sketches is difficult and expensive. In addition, the significant modality difference between sketches and images makes extracting the discriminative feature information difficult. To address above issues, we introduce a novel sketch-style pedestrian dataset named Pseudo-Sketch dataset. Our proposed dataset maximizes the utilization of the existing person dataset resources and is freely available, thus effectively reducing the expenses associated with the training and deployment phases. Furthermore, to mitigate the modality gap between sketches and visible images, a cross-modal feature fusion network is proposed that incorporates information from each modality. Experiment results show that the proposed Pseudo-Sketch dataset can effectively complement the real sketch dataset, and the proposed network obtains competitive results than SOTA methods. The dataset will be released later.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104287"},"PeriodicalIF":2.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yixing Ji, Shengjiang Kong, Weiwei Wang, Xixi Jia, Xiangchu Feng
{"title":"Iterative decoupling deconvolution network for image restoration","authors":"Yixing Ji, Shengjiang Kong, Weiwei Wang, Xixi Jia, Xiangchu Feng","doi":"10.1016/j.jvcir.2024.104288","DOIUrl":"10.1016/j.jvcir.2024.104288","url":null,"abstract":"<div><p>The iterative decoupled deblurring BM3D (IDDBM3D) (Danielyan et al., 2011) combines the analysis representation and the synthesis representation, where deblurring and denoising operations are decoupled, so that both problems can be easily solved. However, the IDDBM3D has some limitations. First, the analysis transformation and the synthesis transformation are analytical, thus have limited representation ability. Second, it is difficult to effectively remove image noise from threshold transformation. Third, there exists hyper-parameters to be tuned manually, which is difficult and time consuming. In this work, we propose an iterative decoupling deconvolution network(IDDNet), by unrolling the iterative decoupling algorithm of the IDDBM3D. In the proposed IDDNet, the analysis/synthesis transformation are implemented by encoder/decoder modules; the denoising is implemented by convolutional neural network based denoiser; the hyper-parameters are estimated by hyper-parameter module. We apply our models for image deblurring and super-resolution. Experimental results show that the IDDNet significantly outperforms the state-of-the-art unfolding networks.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104288"},"PeriodicalIF":2.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142230612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}