{"title":"A temporally-aware noise-informed invertible network for progressive video denoising","authors":"Yan Huang , Huixin Luo , Yong Xu , Xian-Bing Meng","doi":"10.1016/j.imavis.2024.105369","DOIUrl":"10.1016/j.imavis.2024.105369","url":null,"abstract":"<div><div>Video denoising is a critical task in computer vision, aiming to enhance video quality by removing noise from consecutive video frames. Despite significant progress, existing video denoising methods still suffer from challenges in maintaining temporal consistency and adapting to different noise levels. To address these issues, a temporally-aware and noise-informed invertible network is proposed by following divide-and-conquer principle for progressive video denoising. Specifically, a recurrent attention-based reversible network is designed to distinctly extract temporal information from consecutive frames, thus tackling the learning problem of temporal consistency. Simultaneously, a noise-informed two-way dense block is developed by using estimated noise as conditional guidance to adapt to different noise levels. The noise-informed guidance can then be used to guide the learning of dense block for efficient video denoising. Under the framework of invertible network, the designed two parts can be further integrated to achieve invertible learning to enable progressive video denoising. Experiments and comparative studies demonstrate that our method can achieve good denoising accuracy and fast inference speed in both synthetic scenes and real-world applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105369"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vladislav Li , Ilias Siniosoglou , Thomai Karamitsou , Anastasios Lytos , Ioannis D. Moscholios , Sotirios K. Goudos , Jyoti S. Banerjee , Panagiotis Sarigiannidis , Vasileios Argyriou
{"title":"Enhancing 3D object detection in autonomous vehicles based on synthetic virtual environment analysis","authors":"Vladislav Li , Ilias Siniosoglou , Thomai Karamitsou , Anastasios Lytos , Ioannis D. Moscholios , Sotirios K. Goudos , Jyoti S. Banerjee , Panagiotis Sarigiannidis , Vasileios Argyriou","doi":"10.1016/j.imavis.2024.105385","DOIUrl":"10.1016/j.imavis.2024.105385","url":null,"abstract":"<div><div>Autonomous Vehicles (AVs) rely on real-time processing of natural images and videos for scene understanding and safety assurance through proactive object detection. Traditional methods have primarily focused on 2D object detection, limiting their spatial understanding. This study introduces a novel approach by leveraging 3D object detection in conjunction with augmented reality (AR) ecosystems for enhanced real-time scene analysis. Our approach pioneers the integration of a synthetic dataset, designed to simulate various environmental, lighting, and spatiotemporal conditions, to train and evaluate an AI model capable of deducing 3D bounding boxes. This dataset, with its diverse weather conditions and varying camera settings, allows us to explore detection performance in highly challenging scenarios. The proposed method also significantly improves processing times while maintaining accuracy, offering competitive results in conditions previously considered difficult for object recognition. The combination of 3D detection within the AR framework and the use of synthetic data to tackle environmental complexity marks a notable contribution to the field of AV scene analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105385"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial–temporal sequential network for anomaly detection based on long short-term magnitude representation","authors":"Zhongyue Wang, Ying Chen","doi":"10.1016/j.imavis.2024.105388","DOIUrl":"10.1016/j.imavis.2024.105388","url":null,"abstract":"<div><div>Notable advancements have been made in the field of video anomaly detection in recent years. The majority of existing methods approach the problem as a weakly-supervised classification problem based on multi-instance learning. However, the identification of key clips in this context is less precise due to a lack of effective connection between the spatial and temporal information in the video clips. The proposed solution to this issue is the Spatial-Temporal Sequential Network (STSN), which employs the Long Short-Term Magnitude Representation (LST-MR). The processing of spatial and temporal information is conducted in a sequential manner within a spatial–temporal sequential structure, with the objective of enhancing temporal localization performance through the utilization of spatial information. Furthermore, the long short-term magnitude representation is employed in spatial and temporal graphs to enhance the identification of key clips from both global and local perspectives. The combination of classification loss and distance loss is employed with magnitude guidance to reduce the omission of anomalous behaviors. The results on three widely used datasets: UCF-Crime, ShanghaiTech, and XD-Violence, demonstrate that the proposed method performs favorably when compared to existing methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105388"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao
{"title":"Self-ensembling for 3D point cloud domain adaptation","authors":"Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao","doi":"10.1016/j.imavis.2024.105409","DOIUrl":"10.1016/j.imavis.2024.105409","url":null,"abstract":"<div><div>Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105409"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan
{"title":"Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection","authors":"Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan","doi":"10.1016/j.imavis.2025.105435","DOIUrl":"10.1016/j.imavis.2025.105435","url":null,"abstract":"<div><div>Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at <span><span>https://github.com/UESTC-nnLab/STC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105435"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP","authors":"Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin","doi":"10.1016/j.imavis.2024.105359","DOIUrl":"10.1016/j.imavis.2024.105359","url":null,"abstract":"<div><div>When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105359"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar
{"title":"FSBI: Deepfake detection with frequency enhanced self-blended images","authors":"Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar","doi":"10.1016/j.imavis.2025.105418","DOIUrl":"10.1016/j.imavis.2025.105418","url":null,"abstract":"<div><div>Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at <span><span>https://github.com/gufranSabri/FSBI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105418"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProtoMed: Prototypical networks with auxiliary regularization for few-shot medical image classification","authors":"Achraf Ouahab, Olfa Ben Ahmed","doi":"10.1016/j.imavis.2024.105337","DOIUrl":"10.1016/j.imavis.2024.105337","url":null,"abstract":"<div><div>Although deep learning has shown impressive results in computer vision, the scarcity of annotated medical images poses a significant challenge for its effective integration into Computer-Aided Diagnosis (CAD) systems. Few-Shot Learning (FSL) opens promising perspectives for image recognition in low-data scenarios. However, applying FSL for medical image diagnosis presents significant challenges, particularly in learning disease-specific and clinically relevant features from a limited number of images. In the medical domain, training samples from different classes often exhibit visual similarities. Consequently, certain medical conditions may present striking resemblances, resulting in minimal inter-class variation. In this paper, we propose a prototypical network-based approach for few-shot medical image classification for low-prevalence diseases detection. Our method leverages meta-learning to use prior knowledge gained from common diseases, enabling generalization to new cases with limited data. However, the episodic training inherent in meta-learning tends to disproportionately emphasize the connections between elements in the support set and those in the query set, which can compromise the understanding of complex relationships within medical image data during the training phase. To address this, we propose an auxiliary network as a regularizer in the meta-training phase, designed to enhance the similarity of image representations from the same class while enforcing dissimilarity between representations from different classes in both the query and support sets. The proposed method has been evaluated using three medical diagnosis problems with different imaging modalities and different levels of visual imaging details and patterns. The obtained model is lightweight and efficient, demonstrating superior performance in both efficiency and accuracy compared to state-of-the-art. These findings highlight the potential of our approach to improve performance in practical applications, balancing resource limitations with the need for high diagnostic accuracy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105337"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPD-Depth: High performance decoding network for self-supervised monocular depth estimation","authors":"Liehao Wu , Laihua Wang , Guanghui Wei, Yang Yu","doi":"10.1016/j.imavis.2024.105360","DOIUrl":"10.1016/j.imavis.2024.105360","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation methods have shown promising results by leveraging geometric relationships among image sequences for network supervision. However, existing methods often face challenges such as blurry depth edges, high computational overhead, and information redundancy. This paper analyzes and investigates technologies related to deep feature encoding, decoding, and regression, and proposes a novel depth estimation network termed HPD-Depth, optimized by three strategies: utilizing the Residual Channel Attention Transition (RCAT) module to bridge the semantic gap between encoding and decoding features while highlighting important features; adopting the Sub-pixel Refinement Upsampling (SPRU) module to obtain high-resolution feature maps with detailed features; and introducing the Adaptive Hybrid Convolutional Attention (AHCA) module to address issues of local depth confusion and depth boundary blurriness. HPD-Depth excels at extracting clear scene structures and capturing detailed local information while maintaining an effective balance between accuracy and parameter count. Comprehensive experiments demonstrate that HPD-Depth performs comparably to state-of-the-art algorithms on the KITTI benchmarks and exhibits significant potential when trained with high-resolution data. Compared with the baseline model, the average relative error and squared relative error are reduced by 6.09% and 12.62% in low-resolution experiments, respectively, and by 11.3% and 18.5% in high-resolution experiments, respectively. Moreover, HPD-Depth demonstrates excellent generalization performance on the Make3D dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105360"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Zhang , Jianbo Song , Hanyang Liu , Yang Han , Yifan Yang , Huimin Ma
{"title":"AwareTrack: Object awareness for visual tracking via templates interaction","authors":"Hong Zhang , Jianbo Song , Hanyang Liu , Yang Han , Yifan Yang , Huimin Ma","doi":"10.1016/j.imavis.2024.105363","DOIUrl":"10.1016/j.imavis.2024.105363","url":null,"abstract":"<div><div>Current popular trackers, whether based on the Siamese network or Transformer, have focused their main work on relation modeling between the template and the search area, and on the design of the tracking head, neglecting the fundamental element of tracking, the template. Templates are often mixed with too much background information, which can interfere with the extraction of template features. To address the above issue, a template object-aware tracker (AwareTrack) is proposed. Through the information interaction between multiple templates, the attention of the templates can be truly focused on the object itself, and the background interference can be suppressed. To ensure that the foreground objects of the templates have the same appearance to the greatest extent, the concept of awareness templates is proposed, which consists of two close frames. In addition, an awareness templates sampling method based on similarity discrimination via Siamese network is also proposed, which adaptively determines the interval between two awareness templates, ensure the maximization of background differences in the awareness templates. Meanwhile, online updates to the awareness templates ensure that our tracker has access to the most recent features of the foreground object. Our AwareTrack achieves state-of-the-art performance on multiple benchmarks, particularly on the one-shot tracking benchmark GOT-10k, achieving the AO of 78.1%, which is a 4.4% improvement over OSTrack-384.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105363"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}