{"title":"Towards on-device continual learning with Binary Neural Networks in industrial scenarios","authors":"Lorenzo Vorabbi , Angelo Carraggi , Davide Maltoni , Guido Borghi , Stefano Santi","doi":"10.1016/j.imavis.2025.105524","DOIUrl":"10.1016/j.imavis.2025.105524","url":null,"abstract":"<div><div>This paper addresses the challenges of deploying deep learning models, specifically Binary Neural Networks (BNNs), on resource-constrained embedded devices within the Internet of Things context. As deep learning continues to gain traction in IoT applications, the need for efficient models that can learn continuously from incremental data streams without requiring extensive computational resources has become more pressing. We propose a solution that integrates Continual Learning with BNNs, utilizing replay memory to prevent catastrophic forgetting. Our method focuses on quantized neural networks, introducing the quantization also for the backpropagation step, significantly reducing memory and computational requirements. Furthermore, we enhance the replay memory mechanism by storing intermediate feature maps (<em>i.e.</em> latent replay) with 1-bit precision instead of raw data, enabling efficient memory usage. In addition to well-known benchmarks, we introduce the DL-Hazmat dataset, which consists of over 140k high-resolution grayscale images of 64 hazardous material symbols. Experimental results show a significant improvement in model accuracy and a substantial reduction in memory requirements, demonstrating the effectiveness of our method in enabling deep learning applications on embedded devices in real-world scenarios. Our work expands the application of Continual Learning and BNNs for efficient on-device training, offering a promising solution for IoT and other resource-constrained environments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105524"},"PeriodicalIF":4.2,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143760754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"I3Net: Intensive information interaction network for RGB-T salient object detection","authors":"Jia Hou , Hongfa Wen , Shuai Wang , Chenggang Yan","doi":"10.1016/j.imavis.2025.105525","DOIUrl":"10.1016/j.imavis.2025.105525","url":null,"abstract":"<div><div>Multi-modality salient object detection (SOD) is receiving more and more attention in recent years. Infrared thermal images can provide useful information in extreme situations, such as low illumination and cluttered background. Accompany with extra information, we need a more delicate design to properly integrate multi-modal and multi-scale clues. In this paper, we propose an intensively information interaction network (I<sup>3</sup>Net) to perform Red-Green-Blue and Thermal (RGB-T) SOD, which optimizes the performance through modality interaction, level interaction, and scale interaction. Firstly, feature channels from different sources are dynamically selected according to the modality interaction with dynamic merging module. Then, adjacent level interaction is conducted under the guidance of coordinate channel and spatial attention with spatial feature aggregation module. Finally, we deploy pyramid attention module to obtain a more comprehensive scale interaction. Extensive experiments on four RGB-T datasets, VT821, VT1000, VT5000 and VI-RGBT3500, show that the proposed I<sup>3</sup>Net achieves a competitive and excellent performance against 13 state-of-the-art methods in multiple evaluation metrics, with a 1.70%, 1.41%, and 1.54% improvement in terms of weighted F-measure, mean E-measure, and S-measure.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105525"},"PeriodicalIF":4.2,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143760755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao
{"title":"Counterfactual learning and saliency augmentation for weakly supervised semantic segmentation","authors":"Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao","doi":"10.1016/j.imavis.2025.105523","DOIUrl":"10.1016/j.imavis.2025.105523","url":null,"abstract":"<div><div>The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105523"},"PeriodicalIF":4.2,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143747751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changyue Shi , Chuxiao Yang , Xinyuan Hu , Yan Yang , Jiajun Ding , Min Tan
{"title":"MMGS: Multi-Model Synergistic Gaussian Splatting for Sparse View Synthesis","authors":"Changyue Shi , Chuxiao Yang , Xinyuan Hu , Yan Yang , Jiajun Ding , Min Tan","doi":"10.1016/j.imavis.2025.105512","DOIUrl":"10.1016/j.imavis.2025.105512","url":null,"abstract":"<div><div>3D Gaussian Splatting (3DGS) generates a field composed of 3D Gaussians to represent a scene. As the number of input training views decreases, the range of possible solutions that fit only training views expands significantly, making it challenging to identify the optimal result for 3DGS. To this end, a synergistic method is proposed during training and rendering under sparse inputs. The proposed method consists of two main components: Synergistic Transition and Synergistic Rendering. During training, we utilize multiple Gaussian fields to synergize their contributions and determine whether each Gaussian primitive has fallen into an ambiguous region. These regions impede the process for Gaussian primitives to discover alternative positions. This work extends Stochastic Gradient Langevin Dynamic updating and proposes a reformulated version of it. With this reformulation, the Gaussian primitives stuck in ambiguous regions adjust their positions, enabling them to explore an alternative solution. Furthermore, a Synergistic Rendering strategy is implemented during the rendering process. With Gaussian fields trained in the first stage, this approach synergizes the parallel branches to improve the quality of the rendered outputs. With Synergistic Transition and Synergistic Rendering, our method achieves photo-realistic novel view synthesis results under sparse inputs. Extensive experiments demonstrate that our method outperforms previous methods across diverse datasets, including LLFF, Mip-NeRF360, and Blender.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105512"},"PeriodicalIF":4.2,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autonomous navigation and visual navigation in robot mission execution","authors":"Shulei Wang , Yan Wang , Zeyu Sun","doi":"10.1016/j.imavis.2025.105516","DOIUrl":"10.1016/j.imavis.2025.105516","url":null,"abstract":"<div><div>Navigating autonomously in complex environments remains a significant challenge, as traditional methods relying on precise metric maps and conventional path planning algorithms often struggle with dynamic obstacles and demand high computational resources. To address these limitations, we propose a topological path planning approach that employs Bernstein polynomial parameterization and real-time object guidance to iteratively refine the preliminary path, ensuring smoothness and dynamic feasibility. Simulation results demonstrate that our method outperforms MSMRL, ANS, and NTS in both weighted inverse path length and navigation success rate. In real-world scenarios, it consistently achieves higher success rates and path efficiency compared to the widely used OGMADWA method. These findings confirm that our approach enables efficient and reliable navigation in dynamic environments while maintaining strong adaptability and robustness in path planning.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105516"},"PeriodicalIF":4.2,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143768517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank
{"title":"Two-stream transformer tracking with messengers","authors":"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank","doi":"10.1016/j.imavis.2025.105510","DOIUrl":"10.1016/j.imavis.2025.105510","url":null,"abstract":"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105510"},"PeriodicalIF":4.2,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingliang Gao , Jianhao Sun , Qilei Li , Muhammad Attique Khan , Jianrun Shang , Xianxun Zhu , Gwanggil Jeon
{"title":"Towards trustworthy image super-resolution via symmetrical and recursive artificial neural network","authors":"Mingliang Gao , Jianhao Sun , Qilei Li , Muhammad Attique Khan , Jianrun Shang , Xianxun Zhu , Gwanggil Jeon","doi":"10.1016/j.imavis.2025.105519","DOIUrl":"10.1016/j.imavis.2025.105519","url":null,"abstract":"<div><div>AI-assisted living environments by widely apply the image super-resolution technique to improve the clarity of visual inputs for devices like smart cameras and medical monitors. This increased resolution enables more accurate object recognition, facial identification, and health monitoring, contributing to a safer and more efficient assisted living experience. Although rapid progress has been achieved, most current methods suffer from huge computational costs due to the complex network structures. To address this problem, we propose a symmetrical and recursive transformer network (SRTNet) for efficient image super-resolution via integrating the symmetrical CNN (S-CNN) unit and improved recursive Transformer (IRT) unit. Specifically, the S-CNN unit is equipped with a designed local feature enhancement (LFE) module and a feature distillation attention in attention (FDAA) block to realize efficient feature extraction and utilization. The IRT unit is introduced to capture long-range dependencies and contextual information to guarantee that the reconstruction image preserves high-frequency texture details. Extensive experiments demonstrate that the proposed SRTNet achieves competitive performance regarding reconstruction quality and model complexity compared with the state-of-the-art methods. In the <span><math><mrow><mo>×</mo><mn>2</mn></mrow></math></span>, <span><math><mrow><mo>×</mo><mn>3</mn></mrow></math></span>, and <span><math><mrow><mo>×</mo><mn>4</mn></mrow></math></span> super-resolution tasks, SRTNet achieves the best performance on the BSD100, Set14, Set5, Manga109, and Urban100 datasets while maintaining low computational complexity.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105519"},"PeriodicalIF":4.2,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic semantic prototype perception for text–video retrieval","authors":"Henghao Zhao, Rui Yan, Zechao Li","doi":"10.1016/j.imavis.2025.105515","DOIUrl":"10.1016/j.imavis.2025.105515","url":null,"abstract":"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105515"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models","authors":"Leyuan Sun , Yusuke Yoshiyasu","doi":"10.1016/j.imavis.2025.105522","DOIUrl":"10.1016/j.imavis.2025.105522","url":null,"abstract":"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105522"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFDW: Distribution-aware Filter and Dynamic Weight for open-mixed-domain Test-time adaptation","authors":"Mingwen Shao , Xun Shao , Lingzhuang Meng , Yuanyuan Liu","doi":"10.1016/j.imavis.2025.105521","DOIUrl":"10.1016/j.imavis.2025.105521","url":null,"abstract":"<div><div>Test-time adaptation (TTA) aims to adapt the pre-trained model to the unlabeled test data stream during inference. However, existing state-of-the-art TTA methods typically achieve superior performance in closed-set scenarios, and often underperform in more challenging open mixed-domain TTA scenarios. This can be attributed to ignoring two uncertainties: domain non-stationarity and semantic shifts, leading to inaccurate estimation of data distribution and unreliable model confidence. To alleviate the aforementioned issue, we propose a universal TTA method based on a Distribution-aware Filter and Dynamic Weight, called DFDW. Specifically, in order to improve the model’s discriminative ability to data distribution, our DFDW first designs a distribution-aware threshold to filter known and unknown samples from the test data, and then separates them based on contrastive learning. Furthermore, to improve the confidence and generalization of the model, we designed a dynamic weight consisting of category-reliable weight and diversity weight. Among them, category-reliable weight uses prior average predictions to enhance the guidance of high-confidence samples, and diversity weight uses negative information entropy to increase the influence of diversity samples. Based on the above approach, the model can accurately identify the distribution of semantic shift samples, and widely adapt to the diversity samples in the non-stationary domain. Extensive experiments on CIFAR and ImageNet-C benchmarks show the superiority of our DFDW.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105521"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}