Jing Ma , Yiwei Shi , Shibai Yin , Yibin Wang , Yanfang Fu , Yee-Hong Yang
{"title":"Dual-Branch Wavelet Diffusion models with Dual-Prior Refinement for Underwater Image Enhancement","authors":"Jing Ma , Yiwei Shi , Shibai Yin , Yibin Wang , Yanfang Fu , Yee-Hong Yang","doi":"10.1016/j.jvcir.2025.104535","DOIUrl":"10.1016/j.jvcir.2025.104535","url":null,"abstract":"<div><div>Underwater images often suffer from color distortion and detail loss due to the scattering and absorption of light, presenting significant challenges in Underwater Image Enhancement (UIE). Although wavelet-based learning methods address this problem by correcting colors in low-frequency components and enhancing details in high-frequency components, they still struggle to achieve visual fidelity for human perception. As a perceptually driven approach, conditional Denoising Diffusion Models (CDDMs) combined with wavelet transforms have been widely adopted for UIE. However, these methods often focus on the generative capability of CDDM in the low-frequency components, while neglecting the effectiveness of CDDM in high-frequency processing as well as the role of accurate priors in guiding the diffusion process. To address these limitations, we propose Dual-Branch Wavelet Diffusion models with Dual-Prior Refinement (DwaveDiff) for UIE. By decomposing the image into low-frequency and high-frequency subbands using the Haar wavelet transform, the reduced-dimensional frequency information not only accelerates CDDM inference but also provides distinct subbands, allowing CDDM to effectively handle color correction and detail recovery separately. Specifically, we use the Red Channel Prior image as a condition for the low-frequency branch of the CDDM to correct color, and the Edge Captured Map as a condition for the high-frequency branch of the CDDM to recover details. In addtion, the prior refinement strategy in the CDDM ensures that accurate prior information is used, guiding DwaveDiff to perform effective enhancement. Experimental results on both synthetic and real-world image datasets demonstrate that our method outperforms existing approaches both quantitatively and qualitatively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104535"},"PeriodicalIF":3.1,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xu Guan , Chunyan Hu , Lin Xie , Shuai Yang , Feifei Lee , Qiu Chen
{"title":"EFTrack: Enhanced fusion for visual object tracking","authors":"Xu Guan , Chunyan Hu , Lin Xie , Shuai Yang , Feifei Lee , Qiu Chen","doi":"10.1016/j.jvcir.2025.104554","DOIUrl":"10.1016/j.jvcir.2025.104554","url":null,"abstract":"<div><div>Recently, deep learning-based networks for object tracking mainly adopt the single-stream single-stage framework. However, this approach often overlooks the backbone network’s own limitations. To address the issue, this paper utilizes an independent backbone network to directly construct the tracker and proposes optimizations. First, we propose a contour information enhancement (CIE) module to distinguish objects from the background through frequency domain filtering. Secondly, a patch information fusion (PIF) module is introduced to enable information interaction between non-overlapping patches. Furthermore, a lightweight multi-scale feature fusion module is proposed to enhance the backbone network’s capability to learn multi-scale information. The network’s generalization is enhanced using the DropMAE pre-trained model. The proposed tracker demonstrates superior performance on benchmark datasets, surpassing TATrack-B and SeqTrack-B384 networks by 3.4 % and 1.9 % respectively in terms of the AO metric on the GOT-10k dataset. The code is released at https://github.com/ Nirvanalll/EFTrack.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104554"},"PeriodicalIF":3.1,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fusion feature contrastive learning and supervisory regularization for weakly supervised semantic segmentation","authors":"Weizheng Wang , Lei Zhou , Haonan Wang","doi":"10.1016/j.jvcir.2025.104538","DOIUrl":"10.1016/j.jvcir.2025.104538","url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS) based on image-level labels is a challenging task. WSSS methods using image-level labels typically employ Class Activation Maps (CAM) as pseudo labels. However, many methods using Convolutional Neural Network (CNN) models are affected by their local perception capabilities, resulting in CAM that only distinguish the most salient object regions. To address this issue, building upon the Vision Transformer (ViT) model as the backbone, we design a Fusion Feature Contrastive Learning (FFCL) method that utilizes feature information relationships from ViT’s intermediate layer to guide the final layer’s feature information, improving the quality of CAM. Moreover, We also propose a Supervisory Regularization (SR) strategy that fully utilizes auxiliary CAM feature information to guide the final layer’s CAM, enhancing the completeness of the CAM activation areas. The experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets show that our proposed method achieves prominent improvements.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104538"},"PeriodicalIF":3.1,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive conventional-learning video signal compression framework using texture fulfillment","authors":"Alaa Zain , Trinh Man Hoang , Jinjia Zhou","doi":"10.1016/j.jvcir.2025.104544","DOIUrl":"10.1016/j.jvcir.2025.104544","url":null,"abstract":"<div><div>With the explosive growth of various real-time video applications, it has been recognized that video compression is crucial for efficient data storage and transmission. In the low bit-rate scenario, the conventional video coding standards are possible to have small distortion but contain hand-crafted artifacts. Meanwhile, unlike conventional approaches, learning-based end-to-end techniques emphasize perceptual quality, which usually leads to relatively large distortion. To address this problem, this work proposes a new video compression framework with texture fulfillment (named ACLTF) by collaborating with conventional and learning-based video coding technologies. We separate and compress a video sequence to a small-portion key pack and a dominated non-key pack. On the encoder side, the key pack is compressed with low distortion and high texture information but a relatively low compression ratio by conventional learning. The non-key pack is highly compacted by applying semantic segment-based layered coding. On the decoder side, semantic-based self-enhancement and multi-frame enhancement are applied to transfer and interpolate the high-texture information from the key pack to the non-key pack. All the existing video coding systems are compatible with the proposed ACLTF. Experimental results verified that by applying ACLTF to the latest video coding standards (H.266/VVC, H.265/HEVC) and learning-based video coding, it significantly enhanced the compression results by 18.08%–47.57% BD rate over the standard HEVC in all-intra and improved by 6.08%–15.78% BD rate over the standard VVC in low delay.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104544"},"PeriodicalIF":3.1,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144738816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache-Adapter: Efficient video action recognition using adapter fine-tuning and cache memorization technique","authors":"Tongwei Lu, Chenrui Chang","doi":"10.1016/j.jvcir.2025.104543","DOIUrl":"10.1016/j.jvcir.2025.104543","url":null,"abstract":"<div><div>Traditional video action recognition tasks face significant computational challenges and require extensive computational resources. Recently, several studies have focused on efficient image-to-video transfer learning to address this problem. In this paper, we introduce a novel cache memory-based fine-tuning model called Cache Adapter, which efficiently fine-tunes large image pre-trained models for the video action recognition. Specifically, we freeze the entire pre-trained network and train only the parameters of the Cache Adapter block we designed to fuse spatio-temporal information. We also employ gated recurrent unit (GRU) to update cache information. By freezing most of the network parameters, we only need to train the adapters, significantly reducing the computational cost while achieving excellent performance. Furthermore, extensive experiments on two video action recognition benchmarks demonstrate that our approach can learn high-quality spatio-temporal representations of videos and achieve performance comparable to or even better than previous methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104543"},"PeriodicalIF":3.1,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144738925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PSDR-SNet: Siamese network of potential steganographic signal difference regions for image steganalysis","authors":"Hai Su , Jiamei Liu , Jun Liang","doi":"10.1016/j.jvcir.2025.104542","DOIUrl":"10.1016/j.jvcir.2025.104542","url":null,"abstract":"<div><div>Image steganalysis aims to detect whether an image has been processed with steganography to carry hidden information. The steganalysis algorithm based on Siamese network determines whether an image contains hidden information by calculating the dissimilarity between its left and right partitions, offering a new approach. However, it does not consider that steganographic signals often embed in the edges or texture-complex regions of the cover image, leaving significant room for improvement. To address this, this paper proposes a Siamese network of potential steganographic signal difference regions for image steganalysis. The proposed method fully considers the distribution characteristics of steganographic signals by segmenting the target image into texture-complex regions that are likely to contain more steganographic signals and texture-smooth regions that are likely to contain fewer signals. This provides a more effective regional division strategy for the subsequent network to analyze the differences in steganographic signals between different regions. In addition, a redesigned similarity loss function is introduced to guide the network to focus more on the subtle differences in potential steganographic signals between the segmented regions rather than differences in image content. The presence of steganographic information is determined by calculating the differences in potential steganographic signals across the divided regions. Experimental results on the BOSSbase dataset show that the proposed method achieves a maximum detection accuracy of 91.71% for the SUNIWARD steganographic algorithm at a payload of 0.4 bpp, representing an improvement of 1.58% over the baseline model SiaStegNet. The proposed method also achieves a 1.39% accuracy improvement when detecting the WOW steganographic algorithm. These results fully demonstrate the superior detection performance and robustness of the proposed method in image steganalysis.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104542"},"PeriodicalIF":3.1,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144738817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yakun Ju , Bandara Dissanayake , Rachel Ang , Ling Li , Dennis Sng , Alex Kot
{"title":"Face reconstruction with detailed skin features via three selfie images","authors":"Yakun Ju , Bandara Dissanayake , Rachel Ang , Ling Li , Dennis Sng , Alex Kot","doi":"10.1016/j.jvcir.2025.104529","DOIUrl":"10.1016/j.jvcir.2025.104529","url":null,"abstract":"<div><div>Accurate 3D reconstruction of facial skin features, such as acne, pigmentation, and wrinkles, is essential for digital facial analysis, virtual aesthetics, and dermatological diagnostics. However, achieving high-fidelity skin detail reconstruction from limited, in-the-wild inputs like selfie images remains a largely underexplored challenge. The Hierarchical Representation Network (HRN) excels in reconstructing facial geometry from limited images but faces challenges in skin detail fidelity and multi-view matching. In this work, we present a lightweight and deployable system that reconstructs detailed 3D face models from only three guided portrait images. We address these limitations by enhancing HRN’s output resolution, improving skin detail precision, and introducing a novel multi-view texture map fusion framework with illumination normalization and linear blending, enhancing texture clarity. To correct eye direction inconsistencies, we integrate a segmentation network to refine eye regions. We further develop a mobile-based prototype application that guides users through video-based face capture and enables real-time model generation. The system has been successfully applied in real-world settings. Our dataset, featuring annotated portraits of fair-skinned Asian females, with visible skin conditions, serves as a benchmark for evaluation. This is the first benchmark focusing on skin-level 3D reconstruction from selfie-level inputs. We validated our method through ablation, comparison, and perception studies, all of which demonstrated clear improvements in texture fidelity and fine detail. These results indicate the method’s practical value for 3D facial skin reconstruction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104529"},"PeriodicalIF":3.1,"publicationDate":"2025-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144721509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single model learned image compression utilizing multiple scaling factors","authors":"Ran Wang , Wen Jiang , Heming Sun , Jiro Katto","doi":"10.1016/j.jvcir.2025.104541","DOIUrl":"10.1016/j.jvcir.2025.104541","url":null,"abstract":"<div><div>Image compression is a critical task in multimedia. However, all learned-based single rate compression methods face challenges, such as prolonged training time due to the need for a dedicated model per bitrate and increased memory usage. Some variable rate methods require extra input, conditional networks, or still involve training multiple models. In this paper, we propose a unified approach using scaling factors to enable variable rate compression within a single model. The scaling factors consist of multi-gain units and quantization step size. The multi-gain units reduce redundancy in encoder and decoder representations, while the quantization step size controls quantization error. We also observe unevenness among slices in the Channel-Wise entropy model, and propose channel-wise quantization compensation by assigning specific step sizes to each slice. Our method supports continuous rate adaptation without retraining. Extensive experiments on CNN-based, Transformer-based, and CNN-Transformer mixed models demonstrate superior performance across a wide range of bitrates.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104541"},"PeriodicalIF":3.1,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantization-friendly super-resolution: Unveiling the benefits of activation normalization","authors":"Dongjea Kang, Myungjun Son, Hongjae Lee, Seung-Won Jung","doi":"10.1016/j.jvcir.2025.104539","DOIUrl":"10.1016/j.jvcir.2025.104539","url":null,"abstract":"<div><div>Super-resolution (SR) has achieved remarkable progress with deep neural networks, but the substantial memory and computational demands of SR networks limit their use in resource-constrained environments. To address these challenges, various quantization methods have been developed, focusing on managing the diverse and asymmetric activation distributions in SR networks. This focus is crucial, as most SR networks exclude batch normalization (BN) due to concerns about image quality degradation from limited activation range flexibility. However, this decision is made in the context of full-precision SR networks, leaving BN’s impact on quantized SR networks uncertain. This paper revisits BN’s role in quantized SR networks, presenting a detailed performance analysis of multiple quantized SR models with and without BN. Experimental results show that including BN in quantized SR networks enhances performance and simplifies network design through minor yet significant structural adjustments. These findings challenge conventional assumptions and offer new insights for SR network optimization.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104539"},"PeriodicalIF":2.6,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading","authors":"Byung Hoon Lee , Wooseok Shin , Sung Won Han","doi":"10.1016/j.jvcir.2025.104540","DOIUrl":"10.1016/j.jvcir.2025.104540","url":null,"abstract":"<div><div>The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (<span><span>https://github.com/Leebh-kor/TD3Net</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104540"},"PeriodicalIF":2.6,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144713544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}