K Jayashree , S Chakaravarthi , J Samyuktha , J Savitha , M Chaarulatha , A Yogeswari , G Samyuktha
{"title":"Secure vision: Integrated anti-spoofing and deep-fake detection system using knowledge distillation approach","authors":"K Jayashree , S Chakaravarthi , J Samyuktha , J Savitha , M Chaarulatha , A Yogeswari , G Samyuktha","doi":"10.1016/j.image.2026.117481","DOIUrl":"10.1016/j.image.2026.117481","url":null,"abstract":"<div><div>The malicious users generate fake videos, images and images that spread misinformation, harass and blackmail the poor people. A wide variety of techniques, including the combining, merging, replacement, and imposition of photos and video recordings, are used to construct deepfakes. Moreover, the audio spoofing and calls are generated through deepfakes, which need specially trained models. Machine learning and deep learning are being rapidly improved, and a variety of techniques and tools are employed in the detection of deepfakes and anti-spoofing. The detection of both anti-spoofing and deepfakes is possible by resolving existing issues like generalizability, overfitting and complexity. To overcome these challenges, the knowledge distillation model is introduced in this paper. The process initiates with pre-processing using the weighted median filter (WmF). Here, the averaging intensity of neighboring pixels helps to smooth out variations. After that, the feature extraction is carried out by Dual attention based dilated ResNeXT with Residual autoencoder (DAD-DRAE). The model provides features with fewer dimensionality. In the classification phase, models like the Optimized Multi-task Transformer induced Relational knowledge distillation model (OMT-RKD) are deployed to categorize distinct classes of Anti-Spoofing and Deepfake Detection. The hyperparameter used in the classification model is tuned by the Tent chaotic Hippo optimization algorithm (TCHOA). The chaotic function increases the convergence, which decreases the model parameter complexity. In the evaluation, the proposed model is trained with three datasets and achieved an accuracy of 98.68%, 98.22% and 98.44% in the Deepfake Detection Challenge (DFDC) dataset, ASVspoof dataset and FaceForensics++, respectively.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117481"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on path planning utilizing landmark salient features and composition techniques","authors":"Cong Xiao, Gang Chen, Zhengwei Yao, Weijie Zhang, Zhaohui Huang, Qingshu Yuan","doi":"10.1016/j.image.2026.117505","DOIUrl":"10.1016/j.image.2026.117505","url":null,"abstract":"<div><div>In virtual reality, traditional path planning often focuses solely on the shortest route, neglecting the pursuit of scenic beauty and immersive experiences for visitors. This paper presents an aesthetic global path planning framework that integrates landmark saliency with photographic composition principles to generate visually pleasing roaming routes in virtual environments. Unlike previous approaches, which either rely solely on visual saliency maps or use handcrafted aesthetic rules for viewpoint selection, our method establishes a unified optimization process that couples perceptual significance and aesthetic spatial flow. Specifically, the framework introduces an Aesthetic A* algorithm that extends traditional A* pathfinding by incorporating a dual-objective cost function balancing geometric distance and aesthetic score. In addition, landmark-aware constraints ensure that the generated viewpoints align with salient scene elements, while a cubic Bezier-based path smoother refines local trajectories to preserve visual continuity. Experiments on various architectural and cultural scenes demonstrate that the proposed method not only improves the visual coherence and landmark coverage of roaming paths but also yields significantly higher subjective aesthetic ratings compared to existing saliency-based or rule-based approaches.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117505"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146190559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Zhan , Zhenmeng Yue , Weili Tian , Huimin Zhao , Guiyuan Xie , Bo Hu , Fangyuan Lei , Guozhu Liang
{"title":"A cross self-attention feature fusion module for 2D multiple human pose estimation","authors":"Jin Zhan , Zhenmeng Yue , Weili Tian , Huimin Zhao , Guiyuan Xie , Bo Hu , Fangyuan Lei , Guozhu Liang","doi":"10.1016/j.image.2026.117507","DOIUrl":"10.1016/j.image.2026.117507","url":null,"abstract":"<div><div>The existing attention mechanisms-based human pose estimation methods detect joints by modeling the global context of keypoints and their surrounding pixels. These methods often have high computational complexity and are prone to overlooking key semantic information about the pose. In this paper, we proposes a novel serial-parallel cross attention fusion module to capture richer structural correlations among multiple human joints. Within this module, a dual-branch attention block was designed. It compresses channel or spatial dimensional information while preserving high-resolution details, effectively reducing computational complexity and minimizing semantic loss. Furthermore, the proposed cross attention fusion strategy endows the dual-branch channel attention block and the original feature fusion operation with self-attention learning capabilities, effectively capturing the global dependency relationships among multiple body joints. We performed extensive experiments based on HRNet on CrowdPose and MS COCO datasets. The experiments demonstrate that our method outperforms similar methods and exhibits superior robustness to occlusion, deformation, and other complex scenarios in pose estimation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117507"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146190558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer tracking with multi-scale extended attention","authors":"Yuanyun Wang, Pengcheng Sha, Jun Wang, Yan Xia","doi":"10.1016/j.image.2026.117503","DOIUrl":"10.1016/j.image.2026.117503","url":null,"abstract":"<div><div>Currently, popular trackers use Transformer as Backbone because Transformer can capture long-distance dependencies in sequence data, giving the model superior global modeling capabilities. However, a single multi-head self-attention structure does not fully utilize the interaction between feature maps of different stages and scales, which may limit their performance in downstream tasks, and the attention patterns exhibit high similarities between different heads, leading to computational redundancy. In this paper, a Multi-Scale Extended Attention block is designed by using sliding extended windows with different expansion rates, which can capture contextual semantic dependencies at different scales. Small local and sparse information interactions can be implemented at each head to effectively capture multi-scale semantic information and reduce computational redundancy. Based on the designed block, an efficient Transformer based feature extraction backbone is designed. It includes multi-scale extended attention (MSEA) blocks in the shallow layers of the backbone network, multi-head self-attention (MHSA) blocks in the deep layers of the network. The backbone network can achieve complementarity between local and global features, and effectively reduces the compensates for the shortcomings of Transformer in feature extraction. The proposed tracker is trained end-to-end and tested on six tracking benchmarks, including UAV123, GOT-10k, LaSOT, TNL2K, TrackingNet and NfS. And it achieves superior tracking performance on these benchmarks. Especially, it achieves an AUC of 68.1% on UAV123 and an AO of 72.4% on GOT-10K.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117503"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146190539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Neri , Sara Baldoni , Marco Carli , Federica Battisti
{"title":"Markerless emotion recognition from full-body movements for Social XR","authors":"Michael Neri , Sara Baldoni , Marco Carli , Federica Battisti","doi":"10.1016/j.image.2026.117489","DOIUrl":"10.1016/j.image.2026.117489","url":null,"abstract":"<div><div>In this work, an emotion recognition system for enhancing social XR applications is presented. Although several techniques for emotion recognition have been proposed in the literature, they either require invasive and advanced equipment or exploit facial expressions, speech excerpts, physiological data, and text. In this contribution, on the contrary, an approach for markerless emotion classification through body language is designed. More specifically, human movements are analyzed over time by extracting the skeleton joints in videos acquired by consumer cameras. A normalization procedure has been introduced to provide a depth-independent skeleton representation without distorting the skeleton shape. The performance of the proposed method have been assessed using a dataset of videos recorded from multiple points of view. An ad-hoc learning-based emotion classifier has been trained to recognize four emotions (happiness, boredom, interest, and disgust) achieving an average accuracy of 72.5%. The pre-processed dataset, code, and demo with pre-trained models are available at <span><span>https://github.com/michaelneri/emotion-recognition-human-movements</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117489"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Yu , Xiaofeng Wang , Yingying Su , Zhiheng Sun , Jiameng Sun
{"title":"Global and interactive graph channel attention for robust stereo matching","authors":"Jun Yu , Xiaofeng Wang , Yingying Su , Zhiheng Sun , Jiameng Sun","doi":"10.1016/j.image.2026.117491","DOIUrl":"10.1016/j.image.2026.117491","url":null,"abstract":"<div><div>Current learning-based stereo matching is generally poor in adaptively exploring the robust and salient features at different scenes, leading to ambiguity of matching, especially in challenging areas. To tackle this problem, inspired by the global representation of the graph, we propose a Graph Channel Attention (GCA) to globally and interactively learn binocular attention for robust stereo matching, instead of traditional separate local monocular attention. We first construct a 2D binocular graph structure with left and right subgraphs, where the left and right channel information can globally interact. After that, our interactive graph inference with cross interaction and inner aggregation is proposed to improve the linkage inference between and within binocular graphs, which can consider global and interactive attention information like real human eyes. Thus, our GCA alters the channel attention from traditional 1D to binocular 2D, which can imitate the global interaction and attention ability of real human eyes. Finally, we utilize the GCA into stereo matching, and experiment results show that our method demonstrates state-of-the-art performance on KITTI 2012/2015 and Middlebury Stereo Evaluation v.3.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117491"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna Ferrarotti , Isabel Rodríguez , Javier Usón , Sara Baldoni , David Barbero , Daniel Berjón , Francisco Morán , Narciso García , Federica Battisti , Jesús Gutiérrez , Marco Carli , Julián Cabrera
{"title":"Quality assessment of 3D reconstructed meshes: Bridging objective metrics, subjective perception, and behavioral cues","authors":"Anna Ferrarotti , Isabel Rodríguez , Javier Usón , Sara Baldoni , David Barbero , Daniel Berjón , Francisco Morán , Narciso García , Federica Battisti , Jesús Gutiérrez , Marco Carli , Julián Cabrera","doi":"10.1016/j.image.2026.117514","DOIUrl":"10.1016/j.image.2026.117514","url":null,"abstract":"<div><div>Assessing the quality of 3D reconstructed models remains a key challenge in multimedia applications, especially in the context of cultural heritage, where visual fidelity and perceptual realism are equally crucial. This study investigates how reconstruction parameters, as well as existing objective quality metrics, align with human perception. In addition, we analyze how perceived quality and user interaction are related. A dataset of 3D models was generated by varying the number of input images, mesh complexity, and texture resolution. Results from a subjective study show that texture resolution significantly affects perceived quality, whereas variations in number of images and mesh complexity have a limited impact. Furthermore, interaction behavior was found to vary with perceived quality, with participants spending more time and exploring larger viewing angles for models receiving higher scores. These findings highlight the need for perceptually grounded, interaction-aware evaluation methodologies and provide guidelines for future perceptual optimization of 3D reconstruction pipelines.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"143 ","pages":"Article 117514"},"PeriodicalIF":2.7,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146190541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAPLE: Combination of multiple angles of view and enhanced pseudo-label generation for unsupervised person re-identification","authors":"Mai T. Do , Anh D. Nguyen","doi":"10.1016/j.image.2025.117462","DOIUrl":"10.1016/j.image.2025.117462","url":null,"abstract":"<div><div>The primary objective of person ReID is to identify a specific individual among various surveillance cameras. Although early studies focused on supervised ReID using deep learning models, the application to real-world scenarios has highlighted challenges, such as large data volumes and increased manual labeling costs. This has led to a surge in interest in unsupervised ReID techniques, which utilize unlabeled data. Unsupervised person ReID methods can be classified into Unsupervised Domain Adaptation (UDA) ReID approaches and fully Unsupervised Learning (USL) ReID approaches. While UDA ReID leverages knowledge transfer from a source to a target domain, it can suffer from limitations due to domain discrepancies. In contrast, USL ReID relies solely on unlabeled datasets, offering flexibility and scalability but grappling with challenges related to feature representation and pseudo-labeling accuracy. This research introduces MAPLE to address these challenges. Our contributions include a novel strategy to integrate local region information with global features called Multi-Angles of View, an improved approach to unsupervised clustering using the DBSCAN method, and the integration of domain adaptation to bolster unsupervised learning. Extensive experiments on benchmarks such as Market-1501 and MSMT17 demonstrate our method’s superior performance compared to some state-of-the-art achievements, confirming its practical potential.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117462"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-stream interaction network with cross-modal contrast distillation for co-salient object detection","authors":"Wujie Zhou , Bingying Wang , Xiena Dong , Caie Xu , Fangfang Qiang","doi":"10.1016/j.image.2025.117454","DOIUrl":"10.1016/j.image.2025.117454","url":null,"abstract":"<div><div>Co-salient object detection is a challenging task. Despite advances in existing detectors, two problems remain unsolved. First, although depth maps complement spatial information, existing methods do not effectively fuse multimodal information, and multiscale features are not aggregated appropriately to predict co-salient maps. Second, existing deep-learning methods usually require large numbers of parameters; thus, model sizes must be reduced while ensuring accuracy to enable them to run on streamlined end devices. We propose a multi-stream interaction cooperative encoder by constructing early fusion branches to improve modal interactions and a two-stage transformer decoder to promote multiscale feature fusion. Finally, a multi-stream interaction network with cross-modal contrast knowledge distillation is proposed to connect student and teacher models to improve the performance of the student model while sustaining low computing requirements and achieving collaborative co-salient detection. Our solution is based on a teacher–student architecture that uses contrastive learning to transfer knowledge between deep networks while enhancing semantic consistency and suppressing noise. We employ cross-modal contrast distillation and attention modules in the encoding and decoding phases, respectively, to enhance the response channel and spatial consistency. In addition, a collaborative contrast-learning module is employed to better convey structural knowledge to help students obtain more accurate group semantic information. Experiments on benchmark datasets show the superior performance of the proposed multi-stream interaction network with cross-modal contrast knowledge distillation in collaborative saliency target detection.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117454"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Micro-expression recognition based on dataset balance and local connected bi-branch network","authors":"Hanpu Wang, Fuyuan Luo, Ju Zhou, Xinyu Liu, Haolin Xia, Tong Chen","doi":"10.1016/j.image.2026.117480","DOIUrl":"10.1016/j.image.2026.117480","url":null,"abstract":"<div><div>Micro-expressions are subtle and transient facial movements that reveal underlying human emotions, and they hold significant research and application value in fields such as public safety, criminal investigation, and clinical diagnosis. However, due to their fleeting duration and low intensity, existing micro-expression datasets are limited in size and suffer from severe class imbalance, which poses great challenges for reliable recognition. In this paper, we propose an assessment-based re-sampling (ASR) strategy to augment micro-expression data and alleviate category imbalance. Specifically, we first employ semi-supervised self-training on the original dataset to learn an assessment model with both high accuracy and high recall. This model is then used to evaluate frames in micro-expression video sequences (excluding those in the training set). The non-apex frames identified through this assessment are subsequently selected to directly expand the underrepresented classes. Furthermore, we design a locally connected bi-branch network (LCB) for micro-expression recognition. In this network, the high-frequency components of micro-expression frames are extracted to capture weak facial muscle movements and combined with global information as complementary input. We conduct extensive experiments on three benchmark datasets, CASME, CASME II, and SAMM. The results demonstrate that our method is both effective and competitive, achieving an accuracy of 90.23% on the SAMM dataset.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117480"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}