{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan , Haiying Xia , Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng
{"title":"Infrared small UAV target detection via depthwise separable residual dense attention network","authors":"Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng","doi":"10.1016/j.jvcir.2025.104703","DOIUrl":"10.1016/j.jvcir.2025.104703","url":null,"abstract":"<div><div>Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104703"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Guo , Zhenxuan Zeng , Jiang Wu , Xiyu Zhang , Siwen Quan , Zhongwen Hu , Yu Zhu , Jiaqi Yang
{"title":"DFF-Matcher: Robust cross-source registration with density-fused feature and bidirectional consensus matching","authors":"Rong Guo , Zhenxuan Zeng , Jiang Wu , Xiyu Zhang , Siwen Quan , Zhongwen Hu , Yu Zhu , Jiaqi Yang","doi":"10.1016/j.jvcir.2026.104746","DOIUrl":"10.1016/j.jvcir.2026.104746","url":null,"abstract":"<div><div>Cross-source point cloud registration plays a pivotal role in enabling seamless 3D perception across heterogeneous sensors. However, this task remains highly challenging due to significant density variations, sensor-specific noise, and partial overlaps between heterogeneous sensors. To address these challenges, we propose DFF-Matcher, a robust framework that integrates density-robust feature learning and bidirectional consensus matching to bridge domain gaps across different sensors. Our approach introduces a density-fused feature module to handle significant point density variations and a self-attention enhanced matching strategy to ensure reliable correspondence estimation. This unified framework establishes a new paradigm for cross-source registration, achieving superior performance across diverse sensor modalities. Extensive experiments demonstrate significant improvements, including 25.4% higher feature matching recall and 22.2% greater registration recall on challenging Kinect-LiDAR datasets, while maintaining robust performance in both indoor and outdoor scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104746"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147398302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ShoeMatch3D: Attention-Enhanced deep learning framework for high-precision 3D shoeprint comparison","authors":"Binrui Li , Zhihan Tian , Linyu Huang , Yong Guo","doi":"10.1016/j.jvcir.2026.104730","DOIUrl":"10.1016/j.jvcir.2026.104730","url":null,"abstract":"<div><div>Shoeprint analysis plays a vital role in forensic investigations, especially in linking impressions to suspect footwear. Structured-light 3D scanning enables high-resolution capture of shoeprint point clouds, preserving geometric and depth details. However, traditional geometry-based methods often struggle with limited feature representation and noise sensitivity. To address this, we propose ShoeMatch3D, a deep learning framework for fine-grained 3D shoeprint comparison. The core network, CA-PointShoeNet, enhances PointNet++ with channel attention to better extract discriminative features. A cosine similarity-based triplet loss further optimizes the embedding space for robust matching. Experiments on a self-collected dataset demonstrate strong performance, with accuracies of 95.50%, 93.21%, and 90.90% on training, testing, and validation sets, respectively. These results confirm the method’s effectiveness and its potential for broader 3D forensic identification tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104730"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yafang Xiao , Wei Jiang , Shihua Zhou , Bin Wang , Pengfei Wang , Pan Zheng
{"title":"Multimodal prompt-guided vision transformer for precise image manipulation localization","authors":"Yafang Xiao , Wei Jiang , Shihua Zhou , Bin Wang , Pengfei Wang , Pan Zheng","doi":"10.1016/j.jvcir.2026.104736","DOIUrl":"10.1016/j.jvcir.2026.104736","url":null,"abstract":"<div><div>With the rise of generative AI and advanced image editing technologies, image manipulation localization has become more challenging. Existing methods often struggle with limited semantic understanding and insufficient spatial detail capture, especially in complex scenarios. To address these issues, we propose a novel multimodal text-guided framework for image manipulation localization. By fusing textual prompts with image features, our approach enhances the model’s ability to identify manipulated regions. We introduce a Multimodal Interaction Prompt Module (MIPM) that uses cross-modal attention mechanisms to align visual and textual information. Guided by multimodal prompts, our Vision Transformer-based model accurately localizes forged areas in images. Extensive experiments on public datasets, including CASIAv1 and Columbia, show that our method outperforms existing approaches. Specifically, on the CASIAv1 dataset, our approach achieves an F1 score of 0.734, surpassing the second-best method by 1.3%. These results demonstrate the effectiveness of our multimodal fusion strategy. The code is available at <span><span>https://github.com/Makabaka613/MPG-ViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104736"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing temporal action localization through cross-modal and cross-structural knowledge distillation","authors":"Yue Yu, Cheng Wang, Yuxin Shi","doi":"10.1016/j.jvcir.2026.104734","DOIUrl":"10.1016/j.jvcir.2026.104734","url":null,"abstract":"<div><div>This paper proposes Cross-Modal and Cross-Structure distillation for rgb-based temporal action detection(C2MS-Net), a novel fully supervised approach for enhancing temporal action localization by leveraging cross-modal and cross-structural distillation techniques. By integrating information from multiple modalities and structural representations, C2MS-Net significantly improves the discriminative power of action proposals. A distillation framework is introduced, which transfers knowledge from a teacher model trained on rich multi-modal data to a more efficient student model. This approach not only enhances temporal localization accuracy but also improves the robustness of action detection against visual content variations. Extensive experiments on benchmark datasets demonstrate that the proposed C2MS-Net performs competitively with or surpasses state-of-the-art methods, particularly at lower and mid-range tIoU thresholds, while offering substantial gains in computational efficiency. By eliminating the need for optical flow extraction, the proposed method substantially reduces computational complexity, achieving faster inference speeds and smaller model sizes without compromising accuracy. Code, dataset and models are available at: <span><span>https://github.com/wangcheng666/ActionFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104734"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoqing Zhang , Shichao Kan , Yigang Cen , Yi Cen , Qi Cao , Yansen Huang , Ming Zeng
{"title":"Progressively multi-scale feature fusion for semantic segmentation","authors":"Guoqing Zhang , Shichao Kan , Yigang Cen , Yi Cen , Qi Cao , Yansen Huang , Ming Zeng","doi":"10.1016/j.jvcir.2026.104739","DOIUrl":"10.1016/j.jvcir.2026.104739","url":null,"abstract":"<div><div>A fundamental challenge in semantic segmentation is the discriminative learning of pixel-level features. Various semantic segmentation methods and decoders in the literature have been reported to address this challenge. These methods involve directly upsampling feature maps of different sizes and then concatenating them along the channel dimension to generate pixel-level features. However, direct upsampling of feature maps can result in the misalignment of information at the pixel level, leading to suboptimal performance. In this paper, we introduce a novel solution called the Progressive Multi-Scale Feature Fusion (PMSFF) decoder to overcome this issue. Specifically, we develop a lightweight feed-forward network and atrous convolution layer, that are combined as a fusion module to fuse feature maps from adjacent layers. This fusion module is applied to different segments of a network to aggregate all feature maps for semantic segmentation. The fusion module is characterized by a simple and convenient structure with fewer parameters, which can be flexibly embedded into both Convolutional Neural Networks (CNNs) and Transformers to achieve progressive multi-scale pixel-level feature fusion. Extensive experiments on benchmark datasets have been conducted. The results illustrate the effectiveness and efficiency of the proposed module.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104739"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rashmiranjan Nayak, Umesh Chandra Pati, Santos Kumar Das
{"title":"MGLA-DSNet: Multi-head global-local attention-enabled dual-stream network for weakly supervised video anomaly detection","authors":"Rashmiranjan Nayak, Umesh Chandra Pati, Santos Kumar Das","doi":"10.1016/j.jvcir.2026.104744","DOIUrl":"10.1016/j.jvcir.2026.104744","url":null,"abstract":"<div><div>Video Anomaly Detection (VAD) is the process of identifying anomalous events by analyzing spatiotemporal patterns in video. Furthermore, VAD is a complex task due to difficulties in obtaining frame-level annotations, data imbalance issues, and the equivocal and context-dependent nature of video anomalies. To address these issues, this article presents a weakly supervised learning-based Multi-head Global-Local Attention-enabled Dual-Stream Network (MGLA-DSNet) that effectively utilizes spatial (appearance) and temporal (motion) features, with an emphasis on context dependency. The proposed model uses two streams to extract RGB and optical flow features corresponding to appearance (spatial) and motion (temporal) properties, respectively. Subsequently, multi-head global and location attention with adaptive gating and head-wise specialization is applied to the concatenated RGB and Flow features to efficiently model global and local contexts, respectively, using multiple instance learning Finally, the proposed MGLA-DSNet model outperforms state-of-the-art methods across three benchmark datasets, including CUHK Avenue, ShanghaiTech Campus, and UCF-Crime.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104744"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}