Yan Zhang , Zenghui Li , Duo Shen , Ke Wang , Jia Li , Chenxing Xia
{"title":"Information gap based knowledge distillation for occluded facial expression recognition","authors":"Yan Zhang , Zenghui Li , Duo Shen , Ke Wang , Jia Li , Chenxing Xia","doi":"10.1016/j.imavis.2024.105365","DOIUrl":"10.1016/j.imavis.2024.105365","url":null,"abstract":"<div><div>Facial Expression Recognition (FER) with occlusion presents a challenging task in computer vision because facial occlusions result in poor visual data features. Recently, the region attention technique has been introduced to address this problem by researchers, which make the model perceive occluded regions of the face and prioritize the most discriminative non-occluded regions. However, in real-world scenarios, facial images are influenced by various factors, including hair, masks and sunglasses, making it difficult to extract high-quality features from these occluded facial images. This inevitably limits the effectiveness of attention mechanisms. In this paper, we observe a correlation in facial emotion features from the same image, both with and without occlusion. This correlation contributes to addressing the issue of facial occlusions. To this end, we propose a Information Gap based Knowledge Distillation (IGKD) to explore the latent relationship. Specifically, our approach involves feeding non-occluded and masked images into separate teacher and student networks. Due to the incomplete emotion information in the masked images, there exists an information gap between the teacher and student networks. During training, we aim to minimize this gap to enable the student network to learn this relationship. To enhance the teacher’s guidance, we introduce a joint learning strategy where the teacher conducts knowledge distillation on the student during the training of the teacher. Additionally, we introduce two novel constraints, called knowledge learn and knowledge feedback loss, to supervise and optimize both the teacher and student networks. The reported experimental results show that IGKD outperforms other algorithms on four benchmark datasets. Specifically, our IGKD achieves 87.57% on Occlusion-RAF-DB, 87.33% on Occlusion-FERPlus, 64.86% on Occlusion-AffectNet, and 73.25% on FED-RO, clearly demonstrating its effectiveness and robustness. Source code is released at: .</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105365"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu
{"title":"MD-Mamba: Feature extractor on 3D representation with multi-view depth","authors":"Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu","doi":"10.1016/j.imavis.2024.105396","DOIUrl":"10.1016/j.imavis.2024.105396","url":null,"abstract":"<div><div>3D sensors provide rich depth information and are widely used across various fields, making 3D vision a hot topic of research. Point cloud data, as a crucial type of 3D data, offers precise three-dimensional coordinate information and is extensively utilized in numerous domains, especially in robotics. However, the unordered and unstructured nature of point cloud data poses a significant challenge for feature extraction. Traditional methods have relied on designing complex local feature extractors to achieve feature extraction, but these approaches have reached a performance bottleneck. To address these challenges, this paper introduces MD-Mamba, a novel network that enhances point cloud feature extraction by integrating multi-view depth maps. Our approach leverages multi-modal learning, treating the multi-view depth maps as an additional global feature modality. By fusing these with locally extracted point cloud features, we achieve richer and more distinctive representations. We utilize an innovative feature extraction strategy, performing real projections of point clouds and treating multi-view projections as video streams. This method captures dynamic features across viewpoints using a specially designed Mamba network. Additionally, the incorporation of the Siamese Cluster module optimizes feature spacing, improving class differentiation. Extensive evaluations on ModelNet40, ShapeNetPart, and ScanObjectNN datasets validate the effectiveness of MD-Mamba, setting a new benchmark for multi-modal feature extraction in point cloud analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105396"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng
{"title":"Lightweight and efficient feature fusion real-time semantic segmentation network","authors":"Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng","doi":"10.1016/j.imavis.2024.105408","DOIUrl":"10.1016/j.imavis.2024.105408","url":null,"abstract":"<div><div>The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105408"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grad-CAM based explanations for multiocular disease detection using Xception net","authors":"M. Raveenthini , R. Lavanya , Raul Benitez","doi":"10.1016/j.imavis.2025.105419","DOIUrl":"10.1016/j.imavis.2025.105419","url":null,"abstract":"<div><div>Age-related macular degeneration (AMD), cataract, diabetic retinopathy (DR) and glaucoma are the four most common ocular conditions that lead to vision loss. Early detection in asymptomatic stages is necessary to alleviate vision loss. Manual diagnosis is costly, tedious, laborious and burdensome; assistive tools such as computer aided diagnosis (CAD) systems can help to alleviate these issues. Existing CAD systems for ocular diseases primarily address a single disease condition, employing disease-specific algorithms that rely on anatomical and morphological characteristics for localization of regions of interest (ROIs). The dependence on exhaustive image processing algorithms for pre-processing, ROI detection and feature extraction often results in overly complex systems prone to errors that affect classifier performance. Conglomerating many such individual diagnostic frameworks, each targeting a single disease, is not a practical solution for detecting multiple ocular diseases, especially in mass screening. Alternatively, a single generic CAD framework modeled as a multiclass problem serves to be useful in such high throughput scenarios, significantly reducing cost, time and manpower. Nevertheless, ambiguities in the overlapping features of multiple classes representing different diseases should be effectively addressed. This paper proposes a segmentation-independent approach based on deep learning (DL) to realize a single framework for the detection of different ocular conditions. The proposed work alleviates the need for pixel-level operations and segmentation techniques specific to different ocular diseases, offering a solution that has an upper hand compared to conventional systems in terms of complexity and accuracy. Further, explainability is incorporated as a value-addition that assures trust and confidence in the model. The system involves automatic feature extraction from full fundus images using Xception, a pre-trained deep model. Xception utilizes depthwise separable convolutions to capture subtle patterns in fundus images, effectively addressing the similarities between clinical indicators, such as drusen in AMD and exudates in DR, which often lead to misdiagnosis. A random over-sampling technique is performed to address class imbalance by equalizing the number of training samples across the classes. These features are fed to extreme gradient boosting (XGB) for classification. This study further aims to unveil the “black box” paradigm of model classification, by leveraging gradient-weighted class activation mapping (Grad-CAM) technique to highlight relevant ROIs. The combination of Xception based feature extraction and XGB classification results in 99.31% accuracy, 99.5% sensitivity, 99.8% specificity, 99.4% F1-score and 99.4% precision. The proposed system can be a promising tool aiding conventional manual screening in primary health care centres and mass screening scenarios for efficiently diagnosing multiple ocular dise","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105419"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enriching visual feature representations for vision–language tasks using spectral transforms","authors":"Oscar Ondeng, Heywood Ouma, Peter Akuon","doi":"10.1016/j.imavis.2024.105390","DOIUrl":"10.1016/j.imavis.2024.105390","url":null,"abstract":"<div><div>This paper presents a novel approach to enrich visual feature representations for vision–language tasks, such as image classification and captioning, by incorporating spectral transforms. Although spectral transforms have been widely utilized in signal processing, their application in deep learning has been relatively under-explored. We conducted extensive experiments on various transforms, including the Discrete Fourier Transform (DFT), Discrete Cosine Transform, Discrete Hartley Transform, and Hadamard Transform. Our findings highlight the effectiveness of the DFT, mainly when using the magnitude of complex outputs, in enriching visual features. The proposed method, validated on the MS COCO and Kylberg datasets, demonstrates superior performance compared to previous models, with a 4.8% improvement in CIDEr scores for image captioning tasks. Additionally, our approach enhances caption diversity by up to 3.1% and improves generation speed by up to 2% in Transformer models. These results underscore the potential of spectral feature enrichment in advancing vision–language tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105390"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-aware for point cloud domain adaptation with self-distillation learning","authors":"Jiming Yang, Feipeng Da, Ru Hong","doi":"10.1016/j.imavis.2025.105430","DOIUrl":"10.1016/j.imavis.2025.105430","url":null,"abstract":"<div><div>Unsupervised domain adaptation aims to apply knowledge gained from a label-rich domain, i.e., the source domain, to a label-scare domain, i.e., the target domain. However, direct alignment between the source and the target domains is challenging due to significant distribution differences. This paper introduces a novel unsupervised domain adaptation method for 3D point clouds. Specifically, to better learn the pattern of the target domain, we propose a self-distillation framework that effectively learns feature representations in a large-scale unlabeled target domain while enhancing resilience to noise and variations. Moreover, we propose Asymmetric Transferable Semantic Augmentation (AsymTSA) to bridge the gaps between theory and practical issues by extending the multivariate normal distribution assumption to multivariate skew-normal distribution, and progressively learning the semantic information in the target domain. Comprehensive experiments conducted on two benchmarks, PointDA-10, and GraspNetPC-10, and the results demonstrate the effectiveness and superiority of our method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105430"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correlation embedding semantic-enhanced hashing for multimedia retrieval","authors":"Yunfei Chen , Yitian Long , Zhan Yang , Jun Long","doi":"10.1016/j.imavis.2025.105421","DOIUrl":"10.1016/j.imavis.2025.105421","url":null,"abstract":"<div><div>Due to its feature extraction and information processing advantages, deep hashing has achieved significant success in multimedia retrieval. Currently, mainstream unsupervised multimedia hashing methods do not incorporate associative relationship information as part of the original features in generating hash codes, and their similarity measurements do not consider the transitivity of similarity. To address these challenges, we propose the Correlation Embedding Semantic-Enhanced Hashing (CESEH) for multimedia retrieval, which primarily consists of a semantic-enhanced similarity construction module and a correlation embedding hashing module. First, the semantic-enhanced similarity construction module generates a semantically enriched similarity matrix by thoroughly exploring similarity adjacency relationships and deep semantic associations within the original data. Next, the correlation embedding hashing module integrates semantic-enhanced similarity information with intra-modal semantic information, achieves precise correlation embedding and preserves semantic information integrity. Extensive experiments on three widely-used datasets demonstrate that the CESEH method outperforms state-of-the-art unsupervised hashing methods in both retrieval accuracy and robustness. The code is available at <span><span>https://github.com/YunfeiChenMY/CESEH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105421"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2","authors":"Ayushi Verma, Tapas Badal, Abhay Bansal","doi":"10.1016/j.imavis.2025.105422","DOIUrl":"10.1016/j.imavis.2025.105422","url":null,"abstract":"<div><div>The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105422"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partitioned token fusion and pruning strategy for transformer tracking","authors":"Chi Zhang, Yun Gao, Tao Meng, Tao Wang","doi":"10.1016/j.imavis.2025.105431","DOIUrl":"10.1016/j.imavis.2025.105431","url":null,"abstract":"<div><div>Transformer-based tracking algorithms have shown outstanding performance in the field of object tracking due to their powerful global information capture capability. However, the redundant background information in the search region results in interference and high computational complexity in searching for the tracked object. To address this problem, we design a partitioned token fusion and pruning strategy for one-stream transformer trackers. The strategy can achieve a better balance between information retention and interference reduction, and it can improve tracking robustness while accelerating inference. Specifically, we partition search tokens into high-correlation, medium-correlation, and low-correlation based on their relevance to the object template. The feature information in the medium-correlation part is fused into the high-correlation part. Low-correlation tokens are directly discarded. Through the differentiated partitioned token fusion and pruning strategy, we not only reduce the number of tokens in the input network, thus reducing the high computational cost of the transformer, but also improve the robustness of tracking by retaining the useful information of the medium-relevant features while reducing the weight of the accompanying background noise information. The proposed strategy has been comprehensively evaluated experimentally in several challenging public benchmarks, and the results show that our approach achieves excellent overall performance compared with current state-of-the-art tracking methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105431"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quan Tang , Fagui Liu , Dengke Zhang , Jun Jiang , Xuhao Tang , C.L. Philip Chen
{"title":"Increase the sensitivity of moderate examples for semantic image segmentation","authors":"Quan Tang , Fagui Liu , Dengke Zhang , Jun Jiang , Xuhao Tang , C.L. Philip Chen","doi":"10.1016/j.imavis.2024.105357","DOIUrl":"10.1016/j.imavis.2024.105357","url":null,"abstract":"<div><div>Dominant paradigms in modern semantic segmentation resort to the scheme of pixel-wise classification and do supervised training with the standard cross-entropy loss (CE). Although CE is intuitively straightforward and suitable for this task, it only cares about the predicted score of the target category and ignores the probability distribution information. We further notice that fitting hard examples, even if their number is small, results in model over-fitting in the test stage, as accumulated CE losses overwhelm the model during training. Besides, a large number of easy examples may also dazzle the model training. Based on this observation, this work presents a novel loss function we call Sensitive Loss (SL), which utilizes the overall predicted probability distribution information to down-weight the contribution of extremely hard examples (outliers) and easy examples (inliers) during training and rapidly focuses model learning on moderate examples. In this manner, SL encourages the model to learn potential feature generalities rather than diving into the details and noise implied by outliers to the extent. Thus, it is capable of alleviating over-fitting and improving generalization capacity. We also propose a dynamic Learning Rate Scaling (LRS) strategy to alleviate the decreasing gradient and improve the performance of SL. Extensive experiments evidence that our Sensitive Loss is superior to existing handcrafted loss functions and on par with searched losses, which generalize well to a wide range of datasets and algorithms. Specifically, training with the proposed SL brings a notable 1.7% mIoU improvement for the Mask2Former framework on Cityscapes dataset off the shelf.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105357"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}