{"title":"Autonomous navigation and visual navigation in robot mission execution","authors":"Shulei Wang , Yan Wang , Zeyu Sun","doi":"10.1016/j.imavis.2025.105516","DOIUrl":"10.1016/j.imavis.2025.105516","url":null,"abstract":"<div><div>Navigating autonomously in complex environments remains a significant challenge, as traditional methods relying on precise metric maps and conventional path planning algorithms often struggle with dynamic obstacles and demand high computational resources. To address these limitations, we propose a topological path planning approach that employs Bernstein polynomial parameterization and real-time object guidance to iteratively refine the preliminary path, ensuring smoothness and dynamic feasibility. Simulation results demonstrate that our method outperforms MSMRL, ANS, and NTS in both weighted inverse path length and navigation success rate. In real-world scenarios, it consistently achieves higher success rates and path efficiency compared to the widely used OGMADWA method. These findings confirm that our approach enables efficient and reliable navigation in dynamic environments while maintaining strong adaptability and robustness in path planning.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105516"},"PeriodicalIF":4.2,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143768517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank
{"title":"Two-stream transformer tracking with messengers","authors":"Miaobo Qiu , Wenyang Luo , Tongfei Liu , Yanqin Jiang , Jiaming Yan , Wenjuan Li , Jin Gao , Weiming Hu , Stephen Maybank","doi":"10.1016/j.imavis.2025.105510","DOIUrl":"10.1016/j.imavis.2025.105510","url":null,"abstract":"<div><div>Recently, one-stream trackers gradually surpass two-stream trackers and become popular due to their higher accuracy. However, they suffer from a substantial amount of computational redundancy and an increased inference latency. This paper combines the speed advantage of two-stream trackers with the accuracy advantage of one-stream trackers, and proposes a new two-stream Transformer tracker called MesTrack. The core designs of MesTrack lie in the messenger tokens and the message integration module. The messenger tokens obtain the target-specific information during the feature extraction stage of the template branch, while the message integration module integrates the target-specific information from the template branch into the search branch. To further improve accuracy, this paper proposes an adaptive label smoothing knowledge distillation training scheme. This scheme uses the weighted sum of the teacher model’s prediction and the ground truth as supervisory information to guide the training of the student model. The weighting coefficients, which are predicted by the student model, are used to maintain the useful complementary information from the teacher model while simultaneously correcting its erroneous predictions. Evaluation on multiple popular tracking datasets show that MesTrack achieves competitive results. On the LaSOT dataset, the MesTrack-B-384 version achieves a SUC (success rate) score of 73.8%, reaching the SOTA (state of the art) performance, at an inference speed of 69.2 FPS (frames per second). When deployed with TensorRT, the speed can be further improved to 122.6 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105510"},"PeriodicalIF":4.2,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingliang Gao , Jianhao Sun , Qilei Li , Muhammad Attique Khan , Jianrun Shang , Xianxun Zhu , Gwanggil Jeon
{"title":"Towards trustworthy image super-resolution via symmetrical and recursive artificial neural network","authors":"Mingliang Gao , Jianhao Sun , Qilei Li , Muhammad Attique Khan , Jianrun Shang , Xianxun Zhu , Gwanggil Jeon","doi":"10.1016/j.imavis.2025.105519","DOIUrl":"10.1016/j.imavis.2025.105519","url":null,"abstract":"<div><div>AI-assisted living environments by widely apply the image super-resolution technique to improve the clarity of visual inputs for devices like smart cameras and medical monitors. This increased resolution enables more accurate object recognition, facial identification, and health monitoring, contributing to a safer and more efficient assisted living experience. Although rapid progress has been achieved, most current methods suffer from huge computational costs due to the complex network structures. To address this problem, we propose a symmetrical and recursive transformer network (SRTNet) for efficient image super-resolution via integrating the symmetrical CNN (S-CNN) unit and improved recursive Transformer (IRT) unit. Specifically, the S-CNN unit is equipped with a designed local feature enhancement (LFE) module and a feature distillation attention in attention (FDAA) block to realize efficient feature extraction and utilization. The IRT unit is introduced to capture long-range dependencies and contextual information to guarantee that the reconstruction image preserves high-frequency texture details. Extensive experiments demonstrate that the proposed SRTNet achieves competitive performance regarding reconstruction quality and model complexity compared with the state-of-the-art methods. In the <span><math><mrow><mo>×</mo><mn>2</mn></mrow></math></span>, <span><math><mrow><mo>×</mo><mn>3</mn></mrow></math></span>, and <span><math><mrow><mo>×</mo><mn>4</mn></mrow></math></span> super-resolution tasks, SRTNet achieves the best performance on the BSD100, Set14, Set5, Manga109, and Urban100 datasets while maintaining low computational complexity.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105519"},"PeriodicalIF":4.2,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic semantic prototype perception for text–video retrieval","authors":"Henghao Zhao, Rui Yan, Zechao Li","doi":"10.1016/j.imavis.2025.105515","DOIUrl":"10.1016/j.imavis.2025.105515","url":null,"abstract":"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105515"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models","authors":"Leyuan Sun , Yusuke Yoshiyasu","doi":"10.1016/j.imavis.2025.105522","DOIUrl":"10.1016/j.imavis.2025.105522","url":null,"abstract":"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105522"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFDW: Distribution-aware Filter and Dynamic Weight for open-mixed-domain Test-time adaptation","authors":"Mingwen Shao , Xun Shao , Lingzhuang Meng , Yuanyuan Liu","doi":"10.1016/j.imavis.2025.105521","DOIUrl":"10.1016/j.imavis.2025.105521","url":null,"abstract":"<div><div>Test-time adaptation (TTA) aims to adapt the pre-trained model to the unlabeled test data stream during inference. However, existing state-of-the-art TTA methods typically achieve superior performance in closed-set scenarios, and often underperform in more challenging open mixed-domain TTA scenarios. This can be attributed to ignoring two uncertainties: domain non-stationarity and semantic shifts, leading to inaccurate estimation of data distribution and unreliable model confidence. To alleviate the aforementioned issue, we propose a universal TTA method based on a Distribution-aware Filter and Dynamic Weight, called DFDW. Specifically, in order to improve the model’s discriminative ability to data distribution, our DFDW first designs a distribution-aware threshold to filter known and unknown samples from the test data, and then separates them based on contrastive learning. Furthermore, to improve the confidence and generalization of the model, we designed a dynamic weight consisting of category-reliable weight and diversity weight. Among them, category-reliable weight uses prior average predictions to enhance the guidance of high-confidence samples, and diversity weight uses negative information entropy to increase the influence of diversity samples. Based on the above approach, the model can accurately identify the distribution of semantic shift samples, and widely adapt to the diversity samples in the non-stationary domain. Extensive experiments on CIFAR and ImageNet-C benchmarks show the superiority of our DFDW.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105521"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image–text feature learning for unsupervised visible–infrared person re-identification","authors":"Jifeng Guo , Zhiqi Pang","doi":"10.1016/j.imavis.2025.105520","DOIUrl":"10.1016/j.imavis.2025.105520","url":null,"abstract":"<div><div>Visible–infrared person re-identification (VI-ReID) focuses on matching infrared and visible images of the same person. To reduce labeling costs, unsupervised VI-ReID (UVI-ReID) methods typically use clustering algorithms to generate pseudo-labels and iteratively optimize the model based on these pseudo-labels. Although existing UVI-ReID methods have achieved promising performance, they often overlook the effectiveness of text semantics in inter-modality matching and modality-invariant feature learning. In this paper, we propose an image–text feature learning (ITFL) method, which not only leverages text semantics to enhance intra-modality identity-related learning but also incorporates text semantics into inter-modality matching and modality-invariant feature learning. Specifically, ITFL first performs modality-aware feature learning to generate pseudo-labels within each modality. Then, ITFL employs modality-invariant text modeling (MTM) to learn a text feature for each cluster in the visible modality, and utilizes inter-modality dual-semantics matching (IDM) to match inter-modality positive clusters. To obtain modality-invariant and identity-related image features, we not only introduce a cross-modality contrastive loss in ITFL to mitigate the impact of modality gaps, but also develop a text semantic consistency loss to further promote modality-invariant feature learning. Extensive experimental results on VI-ReID datasets demonstrate that ITFL not only outperforms existing unsupervised methods but also competes with some supervised approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105520"},"PeriodicalIF":4.2,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143724889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valerio Guarrasi , Fatih Aksu , Camillo Maria Caruso , Francesco Di Feola , Aurora Rofena , Filippo Ruffini , Paolo Soda
{"title":"A systematic review of intermediate fusion in multimodal deep learning for biomedical applications","authors":"Valerio Guarrasi , Fatih Aksu , Camillo Maria Caruso , Francesco Di Feola , Aurora Rofena , Filippo Ruffini , Paolo Soda","doi":"10.1016/j.imavis.2025.105509","DOIUrl":"10.1016/j.imavis.2025.105509","url":null,"abstract":"<div><div>Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105509"},"PeriodicalIF":4.2,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma
{"title":"Fusing grid and adaptive region features for image captioning","authors":"Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma","doi":"10.1016/j.imavis.2025.105513","DOIUrl":"10.1016/j.imavis.2025.105513","url":null,"abstract":"<div><div>Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105513"},"PeriodicalIF":4.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stealth sight: A multi perspective approach for camouflaged object detection","authors":"Domnic S., Jayanthan K.S.","doi":"10.1016/j.imavis.2025.105517","DOIUrl":"10.1016/j.imavis.2025.105517","url":null,"abstract":"<div><div>Camouflaged object detection (COD) is a challenging task due to the inherent similarity between objects and their surroundings. This paper introduces <strong>Stealth Sight</strong>, a novel framework integrating multi-view feature fusion and depth-based refinement to enhance segmentation accuracy. Our approach incorporates a pretrained multi-view CLIP encoder and a depth extraction network, facilitating robust feature representation. Additionally, we introduce a cross-attention transformer decoder and a post-training pruning mechanism to improve efficiency. Extensive evaluations on benchmark datasets demonstrate that Stealth Sight outperforms state-of-the-art methods in camouflaged object segmentation. Our method significantly enhances detection in complex environments, making it applicable to medical imaging, security, and wildlife monitoring.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105517"},"PeriodicalIF":4.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}