{"title":"ERTFNet: Enhanced RGB-T Fusion Network for semantic segmentation by integrating thermal edge features","authors":"Hanqi Yin , Liguo Zhang , Yiming Sun , Guisheng Yin","doi":"10.1016/j.cviu.2025.104421","DOIUrl":"10.1016/j.cviu.2025.104421","url":null,"abstract":"<div><div>Semantic segmentation is crucial for computer vision, especially in the field of autonomous driving. RGB-Thermal (RGB-T) fusion networks enhance semantic segmentation accuracy in road scenes. However, most existing methods employ the same module structure to extract features from both RGB and thermal images, and all the obtained features are subsequently fused, neglecting the unique characteristics of each modality. Nevertheless, the fused thermal features may introduce noise and redundancy into the network, which is capable of segmenting objects well solely using RGB images. As a result, the performance and accuracy of the approach are limited in complex scenarios. To address this problem, a novel method named Enhanced RGB-T Fusion Network (ERTFNet) is proposed by adopting the encoder–decoder design concept. The constructed encoder in ERTFNet can obtain fused features by combining the extracted edge features from thermal images with RGB image features processed by an attention mechanism. Then, the feature map is restored by a general decoder. Additionally, we introduce the spatial edge constraints during the training stage to further enhance the model’s ability to capture image details and improve both prediction accuracy and boundary clarity. Experiments on two public datasets, compared with existing methods, show that the proposed method can obtain more clear visual contours and higher prediction accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104421"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José Galeas , Alberto Tudela , Óscar Pons , Juan Pedro Bandera , Antonio Bandera , Pablo Bustos
{"title":"CRDT-based knowledge synchronisation in an Internet of Robotics Things ecosystem for Ambient Assisted Living","authors":"José Galeas , Alberto Tudela , Óscar Pons , Juan Pedro Bandera , Antonio Bandera , Pablo Bustos","doi":"10.1016/j.cviu.2025.104437","DOIUrl":"10.1016/j.cviu.2025.104437","url":null,"abstract":"<div><div>Integrating IoT and assistive robots in the design of Ambient Assisted Living (AAL) frameworks has proven to be a useful solution for monitoring and assisting elderly people at home. As a way to manage the information captured and assess the person’s condition, respond to emergencies, promote physical or cognitive exercises, etc., these systems can also integrate a Virtual Caregiver (VC). Given the diversity of technologies deployed in such an AAL framework, deciding how to manage knowledge appropriately can be complex. This paper proposes to organise the AAL framework as a distributed system, i.e., as a collection of autonomous software agents that provide users with a single coherent response. In this distributed system, agents are deployed locally and handle replicas of the knowledge model. The problem of merging these replicas into a consistent representation, therefore arises.The <span><math><mi>δ</mi></math></span>-CRDT (Conflict-free Replicated Data Type) synchronisation mechanism is employed to ensure the eventual consistency with low communication overhead. To manage the dynamics of the AAL ecosystem, the <span><math><mi>δ</mi></math></span>-CRDT is combined with the publish/subscribe interaction protocol. In this way, the performance of the IoT, the robot and the VC, through the functionalities that depend on them, is efficiently adapted to changes in the context. To demonstrate the validity of the proposal, two use cases have been designed in which a collaborative response from the system is required. The first one deals with a possible fall of the user at home, while the second one deals with the problem of helping the person move small objects around the flat. The measured values of latency or consistency in the data show that the proposal works satisfactorily.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104437"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UM-Mamba: An efficient U-network with medical visual state space for medical image segmentation","authors":"Hejian Chen , Qing Liu , Zhongming Fu, Li Liu","doi":"10.1016/j.cviu.2025.104436","DOIUrl":"10.1016/j.cviu.2025.104436","url":null,"abstract":"<div><div>Designing computationally efficient network architectures remains a persistent necessity in medical image segmentation. Lately, State Space Models (SSMs) are emerging in the field of deep learning and gradually becoming effective basic building layers (or blocks) for constructing deep networks. SSMs not only effectively capture long-distance dependencies but also maintain linear computational complexity relative to input sizes. However, the non-sequential structure of 2D images limits its application in visual tasks. To solve this problem, this paper designs a Medical Visual State Space (MVSS) block with 2D Spiral Selective Scanning (SSS2D) module as the core, and constructs a U-shaped medical image segmentation network called UM-Mamba. The SSS2D module traverses the samples through four spiral scanning paths, which makes up for the deficiency of Mamba architecture in the non-sequential structure of 2D images. We conduct experiments on the Kvasir-SEG and ISIC2018 datasets, and achieve the best results in Dice, IoU and MAE by fine-tuning, which proves that UM-Mamba has the leading level in the experimental datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104436"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gong Gao , Zekai Wang , Xianhui Liu , Weidong Zhao
{"title":"FAR-AMTN: Attention Multi-Task Network for Face Attribute Recognition","authors":"Gong Gao , Zekai Wang , Xianhui Liu , Weidong Zhao","doi":"10.1016/j.cviu.2025.104426","DOIUrl":"10.1016/j.cviu.2025.104426","url":null,"abstract":"<div><div>To enhance the generalization performance of Multi-Task Networks (MTN) in Face Attribute Recognition (FAR), it is crucial to share relevant information across multiple related prediction tasks effectively. Traditional MTN methods create shared low-level modules and distinct high-level modules, causing an exponential increase in model parameters with the addition of tasks. This approach also limits feature interaction at the high level, hindering the exploration of semantic relations among attributes, thereby affecting generalization negatively. In response, this study introduces FAR-AMTN, a novel Attention Multi-Task Network for FAR. It incorporates a Weight-Shared Group-Specific Attention (WSGSA) module with shared parameters to minimize complexity while improving group feature representation. Furthermore, a Cross-Group Feature Fusion (CGFF) module is utilized to foster interactions between attribute groups, enhancing feature learning. A Dynamic Weighting Strategy (DWS) is also introduced for synchronized task convergence. Experiments on the CelebA and LFWA datasets demonstrate that the proposed FAR-AMTN demonstrates superior accuracy with significantly fewer parameters compared to existing models.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104426"},"PeriodicalIF":4.3,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144535199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving a segment anything model for segmenting low-quality medical images via an adapter","authors":"Can Bai , Jie Wang , Xianjun Han , Zijian Wu","doi":"10.1016/j.cviu.2025.104425","DOIUrl":"10.1016/j.cviu.2025.104425","url":null,"abstract":"<div><div>With the increase in large models, segmentation foundation models have greatly improved the segmentation results of medical images. However, these foundation models often yield unsatisfactory segmentation results because their training data rarely involve low-quality images. In medical imaging, issues such as considerable noise or poor image resolution are common due to imaging equipment. Using these segmentation foundation models on such images produces poor results. To address this challenge, we utilize a low-quality perception adapter to improve the capabilities of segmentation foundation models, specifically in terms of handling low-quality medical images. First, the low-quality perception adapter distills the intrinsic statistical features from images compromised by noise or reduced clarity. These intrinsic features are aligned with textual-level attributes by employing contrastive learning. Then, we use a text-vision progressive fusion strategy, starting with multilevel text–image fusion to incorporate multimodal information. Next, we incorporate visual features from the underlying segmentation foundation model. Finally, a carefully designed decoder predicts the segmented mask. The low-quality perception adapter reduces the impacts of blur and noise on the developed model, while text-based contrastive learning, along with multimodal fusion, bridge the semantic gap. Experiments demonstrate that the proposed model significantly improves segmentation accuracy on noisy or blurry medical images, with gains up to 24.6% in mIoU and 13.6% in pixel accuracy over state-of-the-art methods across multiple datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104425"},"PeriodicalIF":4.3,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li
{"title":"Adaptive context mining for camouflaged object detection with scribble supervision","authors":"Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li","doi":"10.1016/j.cviu.2025.104430","DOIUrl":"10.1016/j.cviu.2025.104430","url":null,"abstract":"<div><div>Camouflaged object detection (COD) aims to find objects hidden in their surroundings, which has attracted extensive attention in recent years. Although fully supervised COD methods have made considerable progress in performance, they rely heavily on expensive pixel-level annotations. Scribble-based weakly supervised methods can effectively alleviate this problem, but they struggle to fully understand complex COD tasks and achieve outstanding performance due to limited information in the training data. In this paper, inspired by the human visual mechanism, we propose a novel framework for graffiti-based COD, named SCNet. This framework focuses on learning multi-scale context-aware features and employs a two-stage strategy for efficient detection. Specifically, we first adopt the improved Pyramid Visual Transformer (PVTv2) model as the backbone to extract multi-scale global contextual information. A neighbor-interactive decoder (NID) is then designed to coarsely localize potential object regions. Further, a refinement module (RM) is introduced to facilitate multi-scale information interaction and contextual information mining to refine the object regions. In addition, adaptive local camouflage coherence (ALCC) loss is devised to enhance the network’s adaptability to different complex scenarios. Experimental results on three benchmark COD datasets show that SCNet, which utilizes only scribble annotations without any pre- or post-processing, not only outperforms six state-of-the-art weakly supervised methods, but even surpasses some fully supervised COD methods. Moreover, SCNet achieves promising results in a COD-related task (polyp segmentation). The results of our method are available at <span><span>https://github.com/zcc0616/SCNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104430"},"PeriodicalIF":4.3,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144522147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li
{"title":"An effective CNN and Transformer fusion network for camouflaged object detection","authors":"Dongdong Zhang, Chunping Wang, Huiying Wang, Qiang Fu, Zhaorui Li","doi":"10.1016/j.cviu.2025.104431","DOIUrl":"10.1016/j.cviu.2025.104431","url":null,"abstract":"<div><div>Camouflage object detection aims to identify concealed objects in images. Global context and local spatial details are crucial for this task. Convolutional neural network (CNN) excels at capturing fine-grained local features, while Transformer is adept at modeling global contextual information. To leverage their respective strengths, we propose a novel CNN-Transformer fusion network (CTF-Net) for COD to achieve more accurate detection. Our approach employs parallel CNN and Transformer branches as an encoder to extract complementary features. We then propose a cross-domain fusion module (CDFM) to fuse these features with cross-modulation. Additionally, we develop a boundary-aware module (BAM) that combines low-level edge details with high-level global context to extract camouflaged object edge features. Furthermore, we design a feature enhancement module (FEM) to mitigate background and noise interference during cross-layer feature fusion, thereby highlighting camouflaged object regions for precise predictions. Extensive experiments show that CTF-Net outperforms the existing 16 state-of-the-art methods on four widely-used COD datasets. Especially, compared with all the comparison models, CTF-Net significantly improves the performance by <span><math><mo>∼</mo></math></span>5.1% (F-measure) on the NC4K dataset, showing that CTF-Net could accurately detect camouflaged objects. Our code is publicly available at <span><span>https://github.com/zcc0616/CTF-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104431"},"PeriodicalIF":4.3,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144471395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Van Thong Huynh , Seungwon Kim , Hyung-Jeong Yang , Soo-Hyung Kim
{"title":"Multilevel spatial–temporal feature analysis for generic event boundary detection in videos","authors":"Van Thong Huynh , Seungwon Kim , Hyung-Jeong Yang , Soo-Hyung Kim","doi":"10.1016/j.cviu.2025.104429","DOIUrl":"10.1016/j.cviu.2025.104429","url":null,"abstract":"<div><div>Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we propose an approach that leverages multilevel spatial–temporal features to construct a framework for localizing generic events in videos. Our method capitalizes on the correlation between neighbor frames, employing a hierarchy of spatial and temporal features to create a comprehensive representation. Specifically, features from multiple spatial dimensions of a pre-trained ResNet-50 are combined with diverse temporal views, generating a multilevel spatial–temporal feature map. This map facilitates the calculation of similarities between neighbor frames, which are then projected to build a multilevel spatial–temporal similarity feature vector. Subsequently, a decoder employing 1D convolution operations deciphers these similarities, incorporating their temporal relationships to estimate boundary scores effectively. Extensive experiments conducted on the GEBD benchmark dataset demonstrate the superior performance of our system and its variants, outperforming state-of-the-art approaches. Furthermore, additional experiments conducted on the TAPOS dataset, comprising long-form videos with Olympic sport actions, reaffirm the efficacy of our proposed methodology compared to existing techniques.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104429"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Huang , Guangyin Zhang , Zixu Li , Keying Liu , Wenguang Luo
{"title":"Light-YOLO: A lightweight and high-performance network for detecting small obstacles on roads at night","authors":"Dan Huang , Guangyin Zhang , Zixu Li , Keying Liu , Wenguang Luo","doi":"10.1016/j.cviu.2025.104428","DOIUrl":"10.1016/j.cviu.2025.104428","url":null,"abstract":"<div><div>To address the challenges of detecting small obstacles and model portability, this study proposes a lightweight, high-precision, and high-speed small obstacle detection network at nighttime road environments referred to as Light-YOLO. First, the SPDConvMobileNetV3 feature extraction network is introduced, which significantly reduces the total number of parameters while enhancing the ability to capture small obstacle details. Next, to make the network more focused on small obstacles at nighttime conditions, a loss function called Wise-IoU is incorporated, which is more suitable to low-quality images. Finally, to improve overall model performance without increasing the total number of parameters, a parameter-free attention mechanism (SimAM) is integrated. By comparing the publicly available data with the self-built dataset, the experimental results show that Light-YOLO achieves a mean average precision (<span><math><mrow><mi>m</mi><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>) of 97.1% while maintaining a high image processing speed. Additionally, compared to other advanced models in the same series, Light-YOLO has fewer parameters, a smaller computational load (GFLOPs), and reduced model weight (Best.pt). Overall, Light-YOLO strikes a balance between lightweight design, accuracy, and speed, making it more suitable for hardware-constrained devices.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104428"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei
{"title":"GraPLUS: Graph-based Placement Using Semantics for image composition","authors":"Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei","doi":"10.1016/j.cviu.2025.104427","DOIUrl":"10.1016/j.cviu.2025.104427","url":null,"abstract":"<div><div>We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104427"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}