Information FusionPub Date : 2025-05-14DOI: 10.1016/j.inffus.2025.103277
Qi Li , Bojian Chen , Qitong Chen , Xuan Li , Zhaoye Qin , Fulei Chu
{"title":"HSE: A plug-and-play module for unified fault diagnosis foundation models","authors":"Qi Li , Bojian Chen , Qitong Chen , Xuan Li , Zhaoye Qin , Fulei Chu","doi":"10.1016/j.inffus.2025.103277","DOIUrl":"10.1016/j.inffus.2025.103277","url":null,"abstract":"<div><div>Intelligent Fault Diagnosis (IFD) plays a crucial role in industrial applications, where developing foundation models analogous to ChatGPT for comprehensive fault diagnosis remains a significant challenge. Current IFD methodologies are constrained by their inability to construct unified models capable of processing heterogeneous signal types, varying sampling rates, and diverse signal lengths across different equipment. To address these limitations, we propose a novel Heterogeneous Signal Embedding (HSE) module that projects heterogeneous signals into a unified signal space, offering seamless integration with existing IFD architectures as a plug-and-play solution. The HSE framework comprises two primary components: the Temporal-Aware Patching (TAP) module for embedding heterogeneous signals into a unified space, and the Cross-Dimensional Patch Fusion (CDPF) module for fusing embedded signals with temporal information into unified representations. We validate the efficacy of HSE through two comprehensive case studies: a simulation signal dataset and three distinct bearing datasets with heterogeneous features. Our experimental results demonstrate that HSE significantly enhances traditional fault diagnosis models, improving both diagnostic accuracy and generalization capability. While conventional approaches necessitate separate models for specific signal types, sampling frequencies, and signal lengths, HSE-enabled architectures successfully learn unified representations across diverse signal. The results from bearing fault diagnosis applications confirm substantial improvements in both diagnostic precision and cross-dataset generalization. As a pioneering contribution toward IFD foundation models, the proposed HSE framework establishes a fundamental architecture for advancing unified fault diagnosis systems.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103277"},"PeriodicalIF":14.7,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144099099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-13DOI: 10.1016/j.inffus.2025.103282
Shenlu Zhao , Jingyi Wang , Qiang Zhang , Jungong Han
{"title":"Towards efficient RGB-T semantic segmentation via feature generative distillation strategy","authors":"Shenlu Zhao , Jingyi Wang , Qiang Zhang , Jungong Han","doi":"10.1016/j.inffus.2025.103282","DOIUrl":"10.1016/j.inffus.2025.103282","url":null,"abstract":"<div><div>Recently, multimodal knowledge distillation-based methods for RGB-T semantic segmentation have been developed to enhance segmentation performance and inference speeds. Technically, the crux of these models lies in the feature imitative distillation-based strategies, where the student models imitate the working principles of the teacher models through loss functions. Unfortunately, due to the significant gaps in the representation capability between the student and teacher models, such feature imitative distillation-based strategies may not achieve the anticipatory knowledge transfer performance in an efficient way. In this paper, we propose a novel feature generative distillation strategy for efficient RGB-T semantic segmentation, embodied in the Feature Generative Distillation-based Network (FGDNet), which includes a teacher model (FGDNet-T) and a student model (FGDNet-S). This strategy bridges the gaps between multimodal feature extraction and complementary information excavation by using Conditional Variational Auto-Encoder (CVAE) to generate teacher features from student features. Additionally, Multimodal Complementarity Separation modules (MCS-L and MCS-H) are introduced to separate complementary features at different levels. Comprehensive experimental results on four public benchmarks demonstrate that, compared with mainstream RGB-T semantic segmentation methods, our FGDNet-S achieves competitive segmentation performance with lower number of parameters and computational complexity.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103282"},"PeriodicalIF":14.7,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-13DOI: 10.1016/j.inffus.2025.103278
Zhong Chen , Xiaolei Zhang , Xueru Xu , Hanruo Chen , Xiaofei Mi , Jian Yang
{"title":"Registration-aware cross-modal interaction network for optical and SAR images","authors":"Zhong Chen , Xiaolei Zhang , Xueru Xu , Hanruo Chen , Xiaofei Mi , Jian Yang","doi":"10.1016/j.inffus.2025.103278","DOIUrl":"10.1016/j.inffus.2025.103278","url":null,"abstract":"<div><div>The registration of optical and synthetic aperture radar (SAR) images is valuable for exploration due to the inherent complementarity of optical and SAR imagery. However, the substantial radiation and geometric differences between the two modalities present a major obstacle to image registration. Specifically, images from optical and SAR require integration of precise local features and registration-aware global features, and features within and across modalities need to be interacted with efficiently to achieve accurate registration. To tackle this problem, we build a Robust Quadratic Net (RQ-Net) based on the paradigm of describe-then-detect, which is of dual-encoder–decoder design, the first encoder is responsible for encoding local features within each modality through vanilla convolutional operators, while the other is an elaborated Multilayer Cross-modal Registration-aware (MCR) encoder specialized in building global relationships both inner- and inter-modalities, which is conducted effectively at various scales to extract informative features for registration. Furthermore, to cooperate with the network’s training for more well-suited registration feature descriptors, we propose a reconsider loss to review whether the least similar positive feature pairs are matchable and make the RQ-Net achieve a higher matching capability. Through extensive qualitative and quantitative experiments on three paired optical and SAR datasets, RQ-Net has been validated as superior in extracting sufficient features for matching and improving image success registration rates while maintaining low registration errors.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103278"},"PeriodicalIF":14.7,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143941126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-13DOI: 10.1016/j.inffus.2025.103276
Jindou Zhang , Ruiqian Zhang , Xiao Huang , Zhizheng Zhang , Bowen Cai , Xianwei Lv , Zhenfeng Shao , Deren Li
{"title":"Joint content-aware and difference-transform lightweight network for remote sensing images semantic change detection","authors":"Jindou Zhang , Ruiqian Zhang , Xiao Huang , Zhizheng Zhang , Bowen Cai , Xianwei Lv , Zhenfeng Shao , Deren Li","doi":"10.1016/j.inffus.2025.103276","DOIUrl":"10.1016/j.inffus.2025.103276","url":null,"abstract":"<div><div>Advancements in Earth observation technology have enabled effective monitoring of complex surface changes. Semantic change detection (SCD) using high-resolution remote sensing images is crucial for urban planning and environmental monitoring. However, existing deep learning-based SCD methods, which combine semantic segmentation (SS) and binary change detection (BCD), face challenges in lightweight design and consistency between semantic and change results, limiting their accuracy and applicability. To overcome these limitations, we propose the Joint Content-Aware and Difference-Transform Lightweight Network (CDLNet). CDLNet features a lightweight architecture, skip connections, and a multi-task decoding mechanism. The Temporal-Spatial Content-Aware Fusion module (TSAF) in the SS decoding branch incorporates change information to improve semantic classification accuracy within change regions. The Multi-Type Temporal Difference-Transform module (MTDT) in the BCD decoding branch enhances change localization for accurate SCD through efficient transformation of temporal difference features. Experiments on the SECOND, HiUCD mini, MSSCD, and Landsat-SCD datasets demonstrate that CDLNet outperforms thirteen state-of-the-art methods, achieving average improvements of 1.41%, 1.53% and 1.49% in the <span><math><mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>s</mi><mi>c</mi><mi>d</mi></mrow></msub></mrow></math></span>, <span><math><mrow><mi>I</mi><mi>o</mi><mi>U</mi><mi>c</mi></mrow></math></span> and <span><math><mrow><mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi></mrow></math></span> metrics, respectively. Ablation studies confirm the effectiveness of the TSAF and MTDT modules and the rationality of multi-task loss weight configuration. Furthermore, CDLNet utilizes only 20% of the parameters (12.88M) and 7.5% of the FLOPs (30.11G) of the leading model, achieving an inference speed of 41 FPS, which underscores its superior lightweight characteristics. The results indicate that CDLNet offers excellent detection performance, generalization, and robustness within a lightweight framework. The code of our paper is accessible at: <span><span>https://github.com/zjd1836/CDLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103276"},"PeriodicalIF":14.7,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-12DOI: 10.1016/j.inffus.2025.103279
Yingxiao Qiao, Qian Zhao
{"title":"A self-supervised data augmentation strategy for EEG-based emotion recognition","authors":"Yingxiao Qiao, Qian Zhao","doi":"10.1016/j.inffus.2025.103279","DOIUrl":"10.1016/j.inffus.2025.103279","url":null,"abstract":"<div><div>Due to the scarcity problem of electroencephalogram (EEG) data, building high-precision emotion recognition models using deep learning faces great challenges. In recent years, data augmentation has significantly enhanced deep learning performance. Therefore, this paper proposed an innovative self-supervised data augmentation strategy, named SSDAS-EER, to generate high-quality and various artificial EEG feature maps. Firstly, EEG feature maps were constructed by combining differential entropy (DE) and power spectral density (PSD) features to obtain rich spatial and spectral information. Secondly, a masking strategy was used to mask part of the EEG feature maps, which prompted the designed generative adversarial network (GAN) to focus on learning the unmasked feature information and effectively filled in the masked parts. Meanwhile, the elaborated GAN could accurately capture the distribution characteristics of spatial and spectral information, thus ensuring the quality of the generated artificial EEG feature maps. In particular, this paper introduced a self-supervised learning mechanism to further optimize the designed classifier with good generalization ability to the generated samples. This strategy integrated data augmentation and model training into an end-to-end pipeline capable of augmenting EEG data for each subject. In this study, a systematic experiment was conducted on the DEAP dataset, and the results showed that the proposed method achieved an average accuracy of 97.27% and 97.45% on all subjects in valence and arousal, respectively, which was 1.46% and 1.39% higher compared to the time before the strategy was applied. Simultaneously, the similarity between the generated EEG feature maps and the original EEG feature maps was verified. These results indicated that SSDAS-EER had significant performance improvement in EEG emotion recognition tasks, demonstrating its great potential in improving the efficiency of EEG data utilization and emotion recognition accuracy.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103279"},"PeriodicalIF":14.7,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144084094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-12DOI: 10.1016/j.inffus.2025.103281
Yuqun Yang , Jichen Xu , Mengyuan Xu , Xu Tang , Bo Wang , Kechen Shu , Zheng You
{"title":"FSVS-Net: A few-shot semi-supervised vessel segmentation network for multiple organs based on feature distillation and bidirectional weighted fusion","authors":"Yuqun Yang , Jichen Xu , Mengyuan Xu , Xu Tang , Bo Wang , Kechen Shu , Zheng You","doi":"10.1016/j.inffus.2025.103281","DOIUrl":"10.1016/j.inffus.2025.103281","url":null,"abstract":"<div><div>Accurate 3D vessel mapping is essential for surgical planning and interventional treatments. However, the conventional manual slice-by-slice annotation in CT scans is extremely time-consuming, due to the complexity of vessels: sparse distribution, intricate 3D topology, varying sizes, irregular shapes, and low contrast with the background. To address this problem, we propose a few-shot semi-supervised vessel segmentation network (FSVS-Net) applicable to multiple organs. It can leverage a few annotated slices to segment vessel regions in unannotated slices, enabling efficient semi-supervised processing of the entire CT sequences. Specifically, we propose a feature distillation module for FSVS-Net to enhance vessel-specific semantic representations and suppress irrelevant background features. In addition, we design a bidirectional weighted fusion strategy that propagates information from a few annotated slices to unannotated ones in both opposite directions of the CT sequence, effectively modeling 3D vessel continuity and reducing error accumulation. Extensive experiments on three datasets (hepatic vessels, pulmonary vessels, and renal arteries) demonstrate that FSVS-Net achieves state-of-the-art performance in few-shot vessel segmentation task, significantly outperforming existing methods. We collected and annotated three vessel datasets, including clinical data from Tsinghua Changgung Hospital and public sources (e.g., MSD08), for this study. In practice, it reduces the average annotation time from 2 h to 0.5 h per volume, improving efficiency by 4<span><math><mo>×</mo></math></span>. We release three organ-specific vessel datasets and the implementation code at: <span><span>https://github.com/YqunYang/FSVS-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103281"},"PeriodicalIF":14.7,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144071140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-11DOI: 10.1016/j.inffus.2025.103274
Yifeng Wang, Yi Zhao
{"title":"General pre-trained inertial signal feature extraction based on temporal memory fusion","authors":"Yifeng Wang, Yi Zhao","doi":"10.1016/j.inffus.2025.103274","DOIUrl":"10.1016/j.inffus.2025.103274","url":null,"abstract":"<div><div>Inertial sensors are widely used in smartphones, robotics, wearables, aerospace systems, and industrial automation. However, extracting universal features from inertial signals remains challenging. Inertial signal features are encoded in abstract, unreadable waveforms, lacking the visual intuitiveness of images, which makes semantic extraction difficult. The non-stationary nature and complex motion patterns further complicate the feature extraction process. Moreover, the lack of large-scale annotated inertial datasets limits deep learning models to learn universal features and generalize them across expansive applications of inertial sensors. To this end, we propose a Topology Guided Feature Extraction (TG-FE) approach for general inertial signal feature extraction. TG-FE fuses time-series information into graph representations, constructing a Memory Graph by emulating the complex network characteristics of human memory. Guided by small-world network principles, this graph integrates local and global information while sparsity constraints emphasize critical feature interactions. The Memory Graph preserves nonlinear relationships and higher-order dependencies, enabling the model to generalize across scenarios with minimal task-specific tuning. Furthermore, a Cross-Graph Feature Fusion mechanism integrates information across stacked TG-FE modules to enhance representation ability and ensure stable gradient flow. With self-supervised pre-training, the TG-FE modules require only minimal fine-tuning to adapt to various hardware configurations and task scenarios, consistently outperforming comparison methods across all evaluations. Compared to the current state-of-the-art method, our TG-FE achieves 11.7% and 20.0% error reduction in attitude and displacement estimation tasks. Notably, TG-FE achieves an order-of-magnitude advantage in stability evaluations, maintaining robust performance even under 20% noise conditions where competing methods degrade significantly. Overall, this work offers a solution for general inertial signal feature extraction and opens new avenues for applying graph-based deep learning to capture and represent sequential signal features.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103274"},"PeriodicalIF":14.7,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory recall: Retrieval-Augmented mind reconstruction for brain decoding","authors":"Yuxiao Zhao , Guohua Dong , Lei Zhu , Xiaomin Ying","doi":"10.1016/j.inffus.2025.103280","DOIUrl":"10.1016/j.inffus.2025.103280","url":null,"abstract":"<div><div>Reconstructing visual stimuli from functional magnetic resonance imaging (fMRI) is a complex challenge in neuroscience. Most existing approaches rely on mapping neural signals to pretrained models to generate latent variables, which are then used to reconstruct images via a diffusion model. However, this multi-step process can result in the loss of crucial semantic details, limiting reconstruction accuracy. In this paper, we introduce a novel brain decoding framework, called Memory Recall (MR), inspired by bionic brain mechanisms. MR mimics the human visual perception process, where the brain retrieves stored visual experiences to compensate for incomplete visual cues. Initially, low- and high-level visual cues are extracted using spatial mapping techniques based on VAE and CLIP, replicating the brain’s ability to interpret degraded stimuli. A visual experience database is then created to retrieve complementary information that enriches these high-level representations, simulating the brain’s memory retrieval process. Finally, an Attentive Visual Signal Fusion Network (AVSFN) with a novel attention scoring mechanism integrates the retrieved information, enhancing the generative model’s performance and emulating the brain’s refinement of visual perception. Experimental results show that MR outperforms state-of-the-art models across multiple evaluation metrics and subjective assessments. Moreover, our results provide new evidence supporting a well-known psychological conclusion that the basic information capacity of short-term memory is four items, further demonstrating the informativeness and interpretability of our model.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103280"},"PeriodicalIF":14.7,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-supervised representation learning for geospatial objects: A survey","authors":"Yile Chen , Weiming Huang , Kaiqi Zhao , Yue Jiang , Gao Cong","doi":"10.1016/j.inffus.2025.103265","DOIUrl":"10.1016/j.inffus.2025.103265","url":null,"abstract":"<div><div>The proliferation of various data sources in urban and territorial environments has significantly facilitated the development of geospatial artificial intelligence (GeoAI) across a wide range of geospatial applications. However, geospatial data, which is inherently linked to geospatial objects, often exhibits data heterogeneity that necessitates specialized fusion and representation strategies while simultaneously being inherently sparse in labels for downstream tasks. Consequently, there is a growing demand for techniques that can effectively leverage geospatial data without heavy reliance on task-specific labels and model designs. This need aligns with the principles of self-supervised learning (SSL), which has garnered increasing attention for its ability to learn effective and generalizable representations directly from data without extensive labeled supervision. This paper presents a comprehensive and up-to-date survey of SSL techniques specifically applied to or developed for geospatial objects in three primary vector geometric types: <em>Point</em>, <em>Polyline</em>, and <em>Polygon</em>. We systematically categorize various SSL techniques into predictive and contrastive methods, and analyze their adaptation to different data types for representation learning across various downstream tasks. Furthermore, we examine the emerging trends in SSL for geospatial objects, particularly the gradual advancements towards geospatial foundation models. Finally, we discuss key challenges in current research and outline promising directions for future investigation. By offering a structured analysis of existing studies, this paper aims to inspire continued progress in integrating SSL with geospatial objects, and the development of geospatial foundation models in a longer term.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103265"},"PeriodicalIF":14.7,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information FusionPub Date : 2025-05-09DOI: 10.1016/j.inffus.2025.103303
Tahir Mahmood, Ganbayar Batchuluun, Seung Gu Kim, Jung Soo Kim, Kang Ryoung Park
{"title":"A lightweight hierarchical feature fusion network for surgical instrument segmentation in internet of medical things","authors":"Tahir Mahmood, Ganbayar Batchuluun, Seung Gu Kim, Jung Soo Kim, Kang Ryoung Park","doi":"10.1016/j.inffus.2025.103303","DOIUrl":"10.1016/j.inffus.2025.103303","url":null,"abstract":"<div><div>Minimally invasive surgeries (MIS) enhance patient outcomes but pose challenges such as limited visibility, complex hand-eye coordination, and manual endoscope control. The rise of the Internet of Medical Things (IoMT) and telesurgery further demands efficient and lightweight solutions. To address these limitations, we propose a novel lightweight hierarchical feature fusion network (LHFF-Net) for surgical instrument segmentation. LHFF-Net integrates high-, mid-, and low-level encoder features through three novel modules: the multiscale feature aggregation (MFA) module which can capture fine-grained and coarse features across scales, the enhanced spatial attention (ESA) module, prioritizing critical spatial regions, and the enhanced edge module (EEM), refining boundary delineation.</div><div>The proposed model was evaluated on two benchmark datasets, Kvasir-Instrument and UW-Sinus-Surgery, achieving mean Dice coefficients (mDC) of 97.87 % and 88.83 %, respectively, along with mean intersection over union (mIOU) scores of 95.87 % and 84.33 %. These results highlight LHFF-Net’s ability to deliver high segmentation accuracy while maintaining computational efficiency with only 2.2 million parameters. This combination of performance and efficiency makes LHFF-Net a robust solution for IoMT applications, enabling real-time telesurgery and driving innovations in healthcare.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103303"},"PeriodicalIF":14.7,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}