Yan Li , Zhaoyang Li , Jiao Liu , Yongfeng Dong , Jun Zhang
{"title":"Feature aware-contrastive learning network for arbitrary-sized image steganalysis","authors":"Yan Li , Zhaoyang Li , Jiao Liu , Yongfeng Dong , Jun Zhang","doi":"10.1016/j.jvcir.2025.104525","DOIUrl":"10.1016/j.jvcir.2025.104525","url":null,"abstract":"<div><div>Image steganalysis aims to detect whether images contain secret information. In recent years, image steganalysis methods based on deep learning have exhibited remarkable performance in detecting fixed-size images. However, the performance of the model degrades when existing methods are employed for the detection of images with arbitrary sizes. In this paper, we propose a Feature Aware-Contrastive Learning Network (FA-CLNet) for arbitrary-sized image steganalysis. The FA-CLNet contains two significant modules: residual focus-enhancement module (RFEM) and adaptive prototype contrastive learning module (AP-CLM). The RFEM is an essential component of the feature extractor, which aims to suppress the representation of irrelevant features and strengthen the features of the steganographic signal. Furthermore, to improve the discriminability of steganographic signals and general signals, the AP-CLM is designed. The AP-CLM increases the feature difference between the steganographic signal and the regular signal by spatial clustering, while promotes the aggregation of different steganography signal features. To verify the effectiveness of the proposed algorithm, we conducted experiments with various datasets and steganography algorithms. The experimental results show that our method can achieve promising results in arbitrary-sized image steganalysis. Our implementation is publicly available at <span><span>https://github.com/FACLNet/FA-CLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104525"},"PeriodicalIF":2.6,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144614257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiyu Jin, Xuanyu Qi, Haobo Dong, Qiyuan Guan, Guiyue Jin, Lei Fan
{"title":"Degradation removal and detail restoration decomposition network for single image deraining","authors":"Jiyu Jin, Xuanyu Qi, Haobo Dong, Qiyuan Guan, Guiyue Jin, Lei Fan","doi":"10.1016/j.jvcir.2025.104520","DOIUrl":"10.1016/j.jvcir.2025.104520","url":null,"abstract":"<div><div>Existing deraining methods primarily adopt the encoder–decoder architecture with uniform block settings to eliminate image degradation and reconstruct background details, without considering the functional requirements at different stages of the network. This practice can lead to a mismatch between requirements and model responses, thus suffering from serious performance bottlenecks. Based on the above key insights, we propose a <strong>D</strong>egradation-<strong>A</strong>ware <strong>R</strong>emoval <strong>Net</strong>work for single image deraining (DAR-Net), which aims to structurally decouple the degradation removal of rain images and the reconstruction of rain-free images. Specifically, we first use a Two-dimensional Bidirectional Long Short-Term Memory (BiLSTM2D) Block to model the spatial scale and distribution of rain streaks. Simultaneously, we introduce a Degradation Match Removal Block (DMRB) to fit the specific function during the encoder stage, effectively eliminating the degradation. Furthermore, we design a Prompt Block (PB) in the decoder stage to complement the original underlying features and additional contextual information of the image. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104520"},"PeriodicalIF":2.6,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144655815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guang Han, Haiquan Huang, Zhuping Wang, Jiajun Sun, Jian Ye
{"title":"Character feature Alignment-based scene text spotter","authors":"Guang Han, Haiquan Huang, Zhuping Wang, Jiajun Sun, Jian Ye","doi":"10.1016/j.jvcir.2025.104533","DOIUrl":"10.1016/j.jvcir.2025.104533","url":null,"abstract":"<div><div>In recent years, scene text spotter has received much attention due to the possibility of joint training for scene text detection and recognition. However, existing methods not only struggle with precise localization of densely arranged text instances but also lack dynamic adjustment of receptive fields when handling deformed texts, leading to insufficient focus on character-level features. This ultimately results in limited recognition performance and poor cross-scene generalization capability of the model. In this paper, we propose a novel scene text spotter, called Character Feature Alignment-based Scene Text Spotter (CFAS). CFAS uses Swin Transformer to extract scene text image features, and then detects text instances using encoder architecture and Balanced interaction module (BIM). To further improve the recognition features, a Character Alignment module (CA) is proposed to adaptively adjust the receptive field, and finally the text detection and text recognition networks are co-optimized to improve the accuracy of text spotter. This improvement improves both the detection of complex or dense text and the ability of model to recognize text instances with different character shapes. In addition, the model shows strong generalization ability and robustness, and performs well in text detection and recognition in untrained underwater noisy scenes and occluded scenes. Experimental results on various datasets demonstrate the superiority of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104533"},"PeriodicalIF":2.6,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144614258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Similarity-aware generative adversarial network for facial expression image translation","authors":"Lin-Chieh Huang, Hung-Hsu Tsai","doi":"10.1016/j.jvcir.2025.104530","DOIUrl":"10.1016/j.jvcir.2025.104530","url":null,"abstract":"<div><div>This paper proposes an image translation framework for facial expression, which is called Similarity-aware Generative Adversarial Network (SimaGAN). It can encode an image to have style and content features representing class-related detail information and spatial structure, respectively. Moreover, similarity aggregation (SA) is developed for preserving content features to maintain the structure of the input image. Additionally, SimaGAN exploits SA in maximizing the similarity between a set of style features and its corresponding set of label embeddings to enhance the class-related information of the style features and meanwhile minimizing the relative similarity among false-negative style features to effectively learn the disentangle representation. Here, a co-occurrence discriminator is also developed in the design of SimaGAN to promote the image quality of the translated images due to getting the textures of the source image and preserving its detailed textures during the translation. Experimental results demonstrate that SimaGAN outperforms others existing methods consideration here.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104530"},"PeriodicalIF":2.6,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144655814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrDPNet: A transformer-based diffusion model for single-image 3D point cloud reconstruction","authors":"Fei Li , Tiansong Li , Ke Xiao , Lin Wang , Li Yu","doi":"10.1016/j.jvcir.2025.104503","DOIUrl":"10.1016/j.jvcir.2025.104503","url":null,"abstract":"<div><div>The conditional diffusion model has shown great promise in 3D point cloud reconstruction from single-view image. Nevertheless, it is extremely challenging to effectively utilize the only image information to conditionally control the diffusion model to generate 3D point clouds. Previous methods heavily relied on projecting image information onto 3D point clouds and using PointNet to extract features from them. However, due to the locality of the projection method, PointNet may insufficiently fuse point clouds and image features. In this paper, we present TrDPNet, a novel Transformer-based diffusion model for single-image 3D point cloud reconstruction. TrDPNet integrates image features and point clouds for conditional control using the Transformer to achieve high-quality 3D reconstruction. Firstly, farthest point sampling is applied to identify key points, a sub-point cloud is established within the specified radius, and then the features are mapped to tokens in the high-dimensional space. Secondly, a series of cascaded Transformer blocks is utilized to fuse the image and point cloud information via attention mechanisms, conditionally guiding the diffusion model. This design not only integrates image information across the entire point cloud but also strengthens connections between point clouds. Finally, multi-layer perceptrons and linear interpolation restore the tokens to the original point cloud size, producing the final noisy prediction. The experimental results show that TrDPNet achieves over a 20% improvement on synthetic benchmarks compared to previous state-of-the-art methods. Our code and weights are available at <span><span>https://github.com/TLab512/TrDPNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104503"},"PeriodicalIF":2.6,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emadoddin Hemmati , Sina Jarahizadeh , Amir Aghabalaei , Seyed Babak Haji Seyed Asadollah
{"title":"Enhanced monocular depth estimation using novel scale-invariant Error Structure Similarity Index measure optimization in Convolutional Neural network architecture","authors":"Emadoddin Hemmati , Sina Jarahizadeh , Amir Aghabalaei , Seyed Babak Haji Seyed Asadollah","doi":"10.1016/j.jvcir.2025.104531","DOIUrl":"10.1016/j.jvcir.2025.104531","url":null,"abstract":"<div><div>Monocular Depth Estimation (MDE) is crucial for applications like autonomous driving, medical imaging, and 3D modeling. This paper presents a novel Convolutional Neural Network (CNN) architecture that balances performance and computational cost in MDE tasks. Key components include bottleneck mechanisms, Modified Convolutional Block Attention Module (MCBAM), Atrous Spatial Pyramid Pooling (ASPP), and Pyramid Scene Parsing (PSP). Leveraging pre-trained backbones and attention mechanisms, our model significantly improves depth estimation accuracy and reduces computational complexity. Validated using the NYU Depth Dataset V2, our model outperforms existing benchmarks in Absolute Relative Error (Abs Rel), Square Relative Error (Sq Rel), Root Mean Square Error (RMSE), and Thresholding metrics. A novel loss function incorporating Structure Similarity Index Measure (SSIM) and Scale-Invariant Error (SIE) enhances training and evaluation. Our study advances MDE techniques, offering a practical solution with wide-ranging applications. Future research will explore attention mechanisms, fusion approaches, and real-time optimization for greater versatility.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104531"},"PeriodicalIF":2.6,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junzhe Lu , Tingyu Wang , Bin Wan , Qiang Zhao , Shuai Wang , Yaoqi Sun , Yang Zhou , Chenggang Yan
{"title":"Lightweight three-stream encoder–decoder network for multi-modal salient object detection","authors":"Junzhe Lu , Tingyu Wang , Bin Wan , Qiang Zhao , Shuai Wang , Yaoqi Sun , Yang Zhou , Chenggang Yan","doi":"10.1016/j.jvcir.2025.104523","DOIUrl":"10.1016/j.jvcir.2025.104523","url":null,"abstract":"<div><div>Salient object detection technique can identify the most attractive objects in a scene. In recent years, multi-modal salient object detection (SOD) has shown promising prospects. However, most of the existing multi-modal SOD models ignore modal size and computational cost in pursuit of comprehensive cross-modality feature fusion. To enhance the feasibility of high accuracy model in practical applications, we propose a Lightweight Three-stream Encoder–Decoder Network (TENet) for multi-modal salient object detection. Specifically, we design three decoders to explore saliency clues embedded in different multi-modal features and leverage a hierarchical decoding structure to alleviate the negative effects of low-quality images. To reduce the difference among modalities, we propose a lightweight modal information-guided fusion (MIGF) module to enhance the correlation between RGB-D and RGB-T modalities, thus laying the groundwork for triple-modal fusion. Furthermore, to utilize multi-scale information, we propose the semantic interaction (SI) module and the semantic feature enhancement (SFE) module to integrate specific hierarchical information embedded in high- and low-level features. Extensive experiments on the VDT-2048 dataset show that TENet has a model size of 37 MB, an inference speed of 38FPS, and achieves comparable accuracy to 16 state-of-the-art multi-modal methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104523"},"PeriodicalIF":2.6,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CA-VAD: Caption Aware Video Anomaly Detection in surveillance videos","authors":"Debi Prasad Senapati, Santosh Kumar Pani, Santos Kumar Baliarsingh, Prabhu Prasad Dev, Hrudaya Kumar Tripathy","doi":"10.1016/j.jvcir.2025.104521","DOIUrl":"10.1016/j.jvcir.2025.104521","url":null,"abstract":"<div><div>In video anomaly detection, identifying abnormal events using weakly supervised video-level labels is often tackled with multiple instance learning (MIL). However, traditional methods struggle to capture temporal relationships between segments and extract discriminative features for distinguishing normal from anomalous events. To address these challenges, we propose Caption Aware Video Anomaly Detection (CA-VAD), a framework that integrates visual and textual features for enhanced semantic understanding of scenes. Unlike conventional approaches relying solely on visual data, CA-VAD uses a pre-trained video captioning model to generate textual descriptions, transforming them into semantic embeddings that enrich visual features. These textual cues improve the differentiation between normal and abnormal events. CA-VAD incorporates an Attention-based Multi-Scale Temporal Network (A-MTN) to process visual and textual inputs, capturing temporal dynamics effectively. Experiments on CUHK Avenue, ShanghaiTech, UCSD Ped2, and XD-Violence datasets show that CA-VAD outperforms state-of-the-art methods, achieving superior accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104521"},"PeriodicalIF":2.6,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144535831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yafei Du , Shengbing Che , Yangzhuo Tuo , Wenxin Liu , Wanqin Wang , Zixuan Zhang
{"title":"SFEN: Salient feature enhancement network for salient object detection","authors":"Yafei Du , Shengbing Che , Yangzhuo Tuo , Wenxin Liu , Wanqin Wang , Zixuan Zhang","doi":"10.1016/j.jvcir.2025.104522","DOIUrl":"10.1016/j.jvcir.2025.104522","url":null,"abstract":"<div><div>Salient object detection based on multi-scale feature fusion tends to treat all features equally or apply the same fusion modules repeatedly, both of which weaken contextual relevance and can lead to the loss of the main regions of the target. We propose the salient feature enhancement network (SFEN), which incorporates several key innovations. First, we design an edge feature enhancement (EFE) module that focuses on enhancing edge pixels in shallow features to avoid performance degradation caused by excessive processing. Second, we propose the salient object attention (SOA) module, which efficiently fuses adjacent features at different scales, minimizing the risk of losing the main regions of the target. Finally, the details multi-scale fusion (DMF) module refines the local details and generates prediction maps. It also reduces the distributional differences between the encoder and decoder outputs while improving the overall accuracy. We tested on six commonly used salient object detection datasets and compared with advanced methods. Our model achieved an average F-measure of 0.914 on the DUTS dataset, and F-measures of 0.930 and 0.939 on the HKU-IS and ECSSD datasets, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104522"},"PeriodicalIF":2.6,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self2Channel: Self-supervised denoising of different regions using coalition game based channel mask","authors":"Bolin Song, Yuanyuan Si, Ke Li","doi":"10.1016/j.jvcir.2025.104518","DOIUrl":"10.1016/j.jvcir.2025.104518","url":null,"abstract":"<div><div>Denoising approaches using only a single noisy image combined with self-supervised learning of blind spot networks have attracted much attention. However, most existing blind spot denoising strategies use random masking techniques, leading to the loss of complex details in the denoised images. In this paper, we propose a novel technique to generate coalition game-theoretic guided masks that perform non-uniform sampling in different channels to mitigate the loss of complex details and thus improve the denoising quality. Additionally, we introduce a framework called Self2Channel, which combines channel loss with the residual loss to enhance the accuracy of detail location selection, prioritizing the key but easily lost details that need to be preserved. Finally, our framework converges to the sum of the supervision loss and the noise variance while adhering to the expectation property of the measurement space. Extensive experiments validate the superiority of our proposed Self2Channel strategy over state-of-the-art approaches.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104518"},"PeriodicalIF":2.6,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144535830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}