Shuai Yang , Chunyan Hu , Lin Xie , Feifei Lee , Qiu Chen
{"title":"MG-SSAF: An advanced vision Transformer","authors":"Shuai Yang , Chunyan Hu , Lin Xie , Feifei Lee , Qiu Chen","doi":"10.1016/j.jvcir.2025.104578","DOIUrl":"10.1016/j.jvcir.2025.104578","url":null,"abstract":"<div><div>Despite the excellent performance of local-window-based multi-head self-attention (MSA), non-overlapping windows hinder cross-window feature interaction, leading to high computational complexity. This paper presents MG-SSAF, a novel Vision Transformer backbone addressing these problems. First, we propose a <em>Space-wise Separable Multi-head Self-attention</em> (SS-MSA) mechanism to reduce the computational complexity further. Then, an extra <em>Attention Fusion Module</em> (AF Module) is introduced for the attention weights in SS-MSA to enhance the representation ability of the similarity. Next, we present a <em>Multi-scale Global Multi-head Self-attention</em> (MG-MSA) method to perform the global feature interaction. Moreover, we propose to perform the window-based MSA and the global MSA simultaneously in one attention module to realize local feature modeling and global feature interaction. The experimental results demonstrate that the MG-SSAF achieves superior performance with fewer parameters and lower computational complexity. The code is available at <span><span>https://github.com/shuaiyang11/MG-SSAF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104578"},"PeriodicalIF":3.1,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianwu Long, Kaixin Zhang, Shuang Chen, Yuanqin Liu, Qi Luo
{"title":"Edge-Preserving Image Smoothing Based on Local Structure Reconstruction","authors":"Jianwu Long, Kaixin Zhang, Shuang Chen, Yuanqin Liu, Qi Luo","doi":"10.1016/j.jvcir.2025.104577","DOIUrl":"10.1016/j.jvcir.2025.104577","url":null,"abstract":"<div><div>Edge-preserving filters serve as a fundamental component in computational photography and computer vision. Traditional filtering methods are generally classified into local and global approaches; however, the lack of full integration between the two often leads to the degradation of weak structural information. To address this issue, we propose an edge-preserving image smoothing based on local structure reconstruction. The proposed algorithm integrates a global optimization strategy while fully leveraging the intrinsic correlation between neighboring pixels, thereby significantly enhancing both smoothing quality and edge preservation. Our method unifies the <span><math><mrow><msub><mrow><mi>L</mi></mrow><mrow><mi>p</mi></mrow></msub><mrow><mo>(</mo><mn>0</mn><mo><</mo><mi>p</mi><mo>≤</mo><mn>2</mn><mo>)</mo></mrow></mrow></math></span> model framework, enabling diverse smoothing effects by adjusting the parameter <span><math><mi>p</mi></math></span>. Compared to existing edge-preserving filters, the proposed approach demonstrates superior performance in both visual quality and quantitative evaluation metrics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104577"},"PeriodicalIF":3.1,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144989155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolutional Dual-Attention-Network (CDAN): A multiple light intensities based driver emotion recognition method","authors":"Ahad Ahamed , Xiaohui Yang , Tao Xu , Qingbei Guo","doi":"10.1016/j.jvcir.2025.104558","DOIUrl":"10.1016/j.jvcir.2025.104558","url":null,"abstract":"<div><div>Driver emotion recognition is critical for enhancing traffic safety and influencing driver behavior. However, current methods struggle to accurately classify emotions under variable lighting conditions such as bright sunlight, shadows, and low light environments, resulting in inconsistent feature extraction and reduced accuracy. Moreover, many approaches incur high computational costs and excessive feature exchanges, limiting real-world deployment in resource-constrained settings. To address these challenges, we propose the Convolutional Dual-Attention Network (CDAN), a novel framework designed to mitigate the impact of light intensity variations in driving scenarios. Our framework integrates Multi-Convolutional Linear Layer Attention (MCLLA), which leverages linear attention augmented with Rotary Positional Encoding (RoPE) and Locally Enhanced Positional Encoding (LePE) to capture global and local spatial relationships. Additionally, a Convolutional Attention Module (CAM) refines feature maps to improve representation quality. Evaluations of MLI-DER, modified KMU-FED, and CK+ datasets demonstrate its enhanced effectiveness compared to existing methods in handling diverse lighting conditions.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104558"},"PeriodicalIF":3.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144989144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanieh Rafiei , Mojtaba Mahdavi , Ahmad Reza NaghshNilchi
{"title":"NGEMD & ADV-NGEMD: a unified framework for high-capacity, efficient, and adversarially secure image steganography","authors":"Hanieh Rafiei , Mojtaba Mahdavi , Ahmad Reza NaghshNilchi","doi":"10.1016/j.jvcir.2025.104575","DOIUrl":"10.1016/j.jvcir.2025.104575","url":null,"abstract":"<div><div>Exploring image steganography, one prominent technique is Exploiting Modification Direction (EMD), which is favored for its high efficiency achieved through minimal image alterations. However, this efficiency comes at the cost of low capacity, prompting the development of numerous EMD‐based methods primarily focused on increasing payload. Yet, none have managed to simultaneously deliver both high capacity and optimal efficiency. To address these shortcomings, we introduce NGEMD—a next-generation EMD‐based steganographic framework that determines optimal extraction coefficients and bases solely from the pixel group length, maximum per‐pixel change, and number of modifiable pixels <span><math><mrow><mo>(</mo><mi>n</mi><mo>,</mo><mi>z</mi><mo>,</mo><mi>k</mi><mo>)</mo></mrow></math></span>. By deriving recursive relations to establish an ary‐notational system and employing a systematic solution based on the Chinese Remainder Theorem, NGEMD maximizes both capacity and efficiency while significantly reducing computational cost. Due to the inherent weakness of conventional EMD‐based methods against modern steganalysis, we further develop ADV‐NGEMD (Adversarially-NGEMD). We present a scheme to resist deep learning‐based steganalyzers such as YeNet, called ADV‐NGEMD, by considering the hidden message as an adversarial vector and applying changes based on the opposite sign of the gradient while controlling the modifications through a customized cost function. Comprehensive experiments confirm that both NGEMD and ADV‐NGEMD deliver exceptional performance, achieving high payload capacities (up to 2.5 bpp) while preserving visual quality (with PSNR values up to 58 dB and SSIM above 0.99) and, for instance, significantly increasing miss detection rates—from 4 % in NGEMD to as high as 60 % in ADV‐NGEMD at comparable capacities—without sacrificing their high‐capacity advantages.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104575"},"PeriodicalIF":3.1,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145004836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Survey: 3D watermarking techniques","authors":"Ertugrul Gul , Gary K.L. Tam","doi":"10.1016/j.jvcir.2025.104572","DOIUrl":"10.1016/j.jvcir.2025.104572","url":null,"abstract":"<div><div>In today’s world, 3D multimedia data is widely utilized in diverse fields such as military, medical, and remote sensing. The advancement of multimedia technologies, however, exposes 3D multimedia content to an increasing risk of malicious interventions. It has become highly essential to implement security measures ensuring the authenticity and copyright protection of 3D multimedia content. Watermarking is considered one of the most reliable and practical approaches for this purpose. This work provides an in-depth and up-to-date overview of various 3D watermarking methods, covering different data forms, including 3D images, 3D videos, 3D meshes, point clouds, and NeRF. We have categorized these methods from multiple perspectives, comparing their respective advantages and disadvantages. The study also identifies attacks based on data type and discusses metrics for evaluating the methods according to their intended use and data type. We further present observations, research issues, challenges, and future directions for 3D watermarking. This includes strength factor optimization, copyright concerns related to 3D printed objects, detection and recovery of tampered areas, and watermarking in 4D (3D Dynamic) and NeRF domains.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104572"},"PeriodicalIF":3.1,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144911533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taylor expansion-based Kolmogorov–Arnold network for blind image quality assessment","authors":"Ze Chen , Shaode Yu","doi":"10.1016/j.jvcir.2025.104571","DOIUrl":"10.1016/j.jvcir.2025.104571","url":null,"abstract":"<div><div>Kolmogorov–Arnold Network (KAN) has attracted growing interest for its strong function approximation capability. In our previous work, KAN and its variants were explored in score regression for blind image quality assessment (BIQA). However, these models encounter challenges when processing high-dimensional features, leading to limited performance gains and increased computational cost. To address these issues, we propose TaylorKAN that leverages the Taylor expansions as learnable activation functions to enhance local approximation capability. To improve the computational efficiency, network depth reduction and feature dimensionality compression are integrated into the TaylorKAN-based score regression pipeline. On five databases (BID, CLIVE, KonIQ, SPAQ, and FLIVE) with authentic distortions, extensive experiments demonstrate that TaylorKAN consistently outperforms the other KAN-related models, indicating that the local approximation via Taylor expansions is more effective than global approximation using orthogonal functions. Its generalization capacity is validated through inter-database experiments. The findings highlight the potential of TaylorKAN as an efficient and robust model for high-dimensional score regression.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104571"},"PeriodicalIF":3.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144907001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenghao Li , Gang Liang , Jiaping Lin , Liangyin Chen , Wenbo He , Jin Yang
{"title":"MEN-VVDF: Multipath excitation network-based video violence detection framework focusing on human activity in keyframes","authors":"Chenghao Li , Gang Liang , Jiaping Lin , Liangyin Chen , Wenbo He , Jin Yang","doi":"10.1016/j.jvcir.2025.104573","DOIUrl":"10.1016/j.jvcir.2025.104573","url":null,"abstract":"<div><div>To date, video violence detection remains a challenge in visual communication because violent events are sudden and unpredictable, making it difficult to efficiently define and locate the occurrence of violence from video data. In addition, the complexity and redundancy of video limits the existing methods the ability to extract relevant information and the accuracy of detection. Thus, effectively recognizing violence from video clips is still an open problem. This paper proposes a video-level framework for constructing human action sequences and detecting violence. Firstly, a keyframe extraction algorithm is developed to capture representative and informative frames. Then, a strategy is introduced to emphasize human actions and eliminate background bias. Lastly, a novel neural network is designed to excite spatio-temporal, channel, and motion features to effectively model violence. The proposed framework is comprehensively evaluated on two large-scale benchmark datasets. The experimental results demonstrate that the proposed framework outperforms the existing state-of-the-art schemes and achieves classification accuracies of more than 98% and 94% for the two datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104573"},"PeriodicalIF":3.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144921487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing 3D multi-organ segmentation via uncertainty guidance and boundary knowledge distillation","authors":"Xiangchun Yu, Longjun Ding, Tianqi Wu, Dingwen Zhang","doi":"10.1016/j.jvcir.2025.104574","DOIUrl":"10.1016/j.jvcir.2025.104574","url":null,"abstract":"<div><div>We propose the Uncertainty Guidance and Boundary Knowledge Distillation (UGBKD) framework for enhancing 3D multi-organ segmentation performance of student networks. UGBKD integrates three strategies: uncertainty-guided knowledge distillation, learning difficulty mining mechanism, and boundary knowledge distillation. The teacher-student distillation is adeptly guided by leveraging estimated uncertainty and the learning difficulty mining mechanism. Boundary knowledge distillation further alleviates blurred boundary challenges. Initially, a pre-trained denoising autoencoder DAE with anatomical perception priors is employed to estimate prediction uncertainty, and the uncertainty guided strategy promotes consistent knowledge transfer from the teacher. Subsequently, the learning difficulty mining mechanism focuses on difficult areas for the student. Lastly, boundary knowledge distillation extracts and transfers crucial boundary information to enhance the student’s boundary perception. Extensive experiments on WORD and BTCV datasets validate our proposed method’s effectiveness in improving segmentation accuracy and robustness. Code is available at <span><span>https://github.com/wutianqi-Learning/UGBKD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104574"},"PeriodicalIF":3.1,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144903226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel and efficient image dehazing technique for Advanced Driver Assistance Systems","authors":"Harish Babu Gade , Venkata Krishna Odugu , Anirudh Reddy R , Sireesha Pendem","doi":"10.1016/j.jvcir.2025.104570","DOIUrl":"10.1016/j.jvcir.2025.104570","url":null,"abstract":"<div><div>Advanced Driver Assistance Systems (ADAS) rely on clear visual input to ensure safe and accurate driving decisions. However, atmospheric conditions like haze and fog can significantly reduce image visibility and clarity. This paper presents an efficient and lightweight image dehazing method specifically designed for ADAS applications. The proposed approach is based on two core modules: Depth Refinement Transmission Rate Estimation (DRTRE) and Distributed AirLight Estimation (DALE). Unlike deep learning-based techniques, our method does not require any training data or neural networks, making it well-suited for real-time hardware implementation. DRTRE estimates scene depth using the saturation and value components of the image and refines the transmission rate through adaptive thresholding and calibration. DALE improves the estimation of AirLight by analyzing spatially distributed depth values to handle non-uniform haze. Together, these modules restore clear images while minimizing computational overhead. Experimental results show that the proposed dehazing method achieves an average PSNR improvement of up to 18.19% and MSE reduction of approximately 33.80% compared to existing methods. It also demonstrates a consistent improvement in SSIM, with gains of up to 11.63%, indicating enhanced structural fidelity. Furthermore, the method improves the Comprehensive Performance Metric (CPM) by up to 4.07 times and reduces the Naturalness Image Quality Evaluator (NIQE) by as much as 17.17%, confirming superior perceptual and quantitative performance. The complete system is implemented in Verilog Hardware Description Language (HDL) and synthesized on a Xilinx Zynq-7000 series Field Programmable Gate Array (FPGA). The proposed architecture demonstrates substantial hardware efficiency, achieving reductions of up to 98.3% in logic elements, 54.4% in memory registers, and 61.9% in line buffer usage compared to existing designs.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104570"},"PeriodicalIF":3.1,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144890427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shichao Jiao , Liye Long , Liqun Kuang , Fengguang Xiong , Xie Han
{"title":"Multi-modal semantic embedding network for 3D shape recognition and retrieval","authors":"Shichao Jiao , Liye Long , Liqun Kuang , Fengguang Xiong , Xie Han","doi":"10.1016/j.jvcir.2025.104559","DOIUrl":"10.1016/j.jvcir.2025.104559","url":null,"abstract":"<div><div>Current methods for 3D shape recognition and retrieval utilize deep learning techniques, achieving commendable performance through a singular representation while neglecting the multi-modal information inherent to the same 3D object. Furthermore, certain approaches treat recognition and retrieval as distinct tasks; however, these processes should be synergistic rather than antagonistic. In this paper, we propose a multi-modal semantic embedding network designed to deliver a more comprehensive representation of 3D shapes, thereby enhancing recognition accuracy and retrieval efficacy. Initially, we employ two independent feature extractors to derive multi-view and point cloud features. Subsequently, we introduce a multi-modal feature fusion method that emphasizes uncovering correlations between diverse modal features while mitigating information degradation. Finally, we implement a joint learning strategy for the fused features that resolves modal heterogeneity and facilitates joint mapping of visual attributes with semantic labels. Extensive experiments on multiple datasets validate the superiority of our approach.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104559"},"PeriodicalIF":3.1,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144879565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}