Xingru Huang , Tianyun Zhang , Zhaoyang Xu , Jian Huang , Gaopeng Huang , Han Yang , Binfeng Zou , Shouqin Ding , Renjie Ruan , Zhao Huang , Huiyu Zhou , Jin Liu , Zhiwen Zheng , Shaowei Jiang , Xiaoshuai Zhang
{"title":"基于大视觉模型的多方向融合可见光医学图像分割","authors":"Xingru Huang , Tianyun Zhang , Zhaoyang Xu , Jian Huang , Gaopeng Huang , Han Yang , Binfeng Zou , Shouqin Ding , Renjie Ruan , Zhao Huang , Huiyu Zhou , Jin Liu , Zhiwen Zheng , Shaowei Jiang , Xiaoshuai Zhang","doi":"10.1016/j.inffus.2025.103385","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate lesion quantification represents a critical component of precision diagnostics and targeted therapeutic strategies, yet current methodologies face challenges when confronted with the diverse contextual and complicated structures inherent in visible-light medical imaging, including semantic ambiguity, noise interference, and geometric complexity, which collectively hinder segmentation accuracy. Targeting these challenges, we proposes the Multi-Aspect Large Vision Model (MasLVM), a foundational model for optical medical imaging that achieves comprehensive feature fusion across Tri-Path fusion. The Semantic Context Encoder (SCE) integrates a pre-trained large vision model with global semantic embeddings to improve contextual abstraction and mitigate semantic ambiguities. The Spectral Spline Encoder (SSE), incorporating the Multi-Frequency Feature Modulator (MFFM) and Kolmogorov–Arnold Networks (KAN) Channel Attention, transitions image representations into the frequency domain to selectively attenuate noise while preserving essential structural features. The Hierarchical Deformable Morphometry Encoder (HDME) employs deformable convolutions and multi-scale encoding to capture heterogeneous geometric structures dynamically. The outputs from these branches are synthesized through the Multi-Attention KAN Decoder, which employs KAN multiple self-attention and iterative attentional fusion to select and enhance semantic, spectral, and morphological critical domain features adaptively. Extensive experiments across six widely recognized datasets demonstrate that MasLVM achieves convincing performance compared with multiple previous state-of-the-art (SoTA) methods, and potential utility in adapting to diverse requirements of visible-light medical imaging tasks under constrained conditions. The code and model weights can be directly used for medical task deployment or fine-tuning, and are publicly available at the following link: <span><span>https://github.com/IMOP-lab/MasLVM-Pytorch</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"125 ","pages":"Article 103385"},"PeriodicalIF":15.5000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-aspect fusion in foundational large vision model for visible light medical imaging segmentation\",\"authors\":\"Xingru Huang , Tianyun Zhang , Zhaoyang Xu , Jian Huang , Gaopeng Huang , Han Yang , Binfeng Zou , Shouqin Ding , Renjie Ruan , Zhao Huang , Huiyu Zhou , Jin Liu , Zhiwen Zheng , Shaowei Jiang , Xiaoshuai Zhang\",\"doi\":\"10.1016/j.inffus.2025.103385\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accurate lesion quantification represents a critical component of precision diagnostics and targeted therapeutic strategies, yet current methodologies face challenges when confronted with the diverse contextual and complicated structures inherent in visible-light medical imaging, including semantic ambiguity, noise interference, and geometric complexity, which collectively hinder segmentation accuracy. Targeting these challenges, we proposes the Multi-Aspect Large Vision Model (MasLVM), a foundational model for optical medical imaging that achieves comprehensive feature fusion across Tri-Path fusion. The Semantic Context Encoder (SCE) integrates a pre-trained large vision model with global semantic embeddings to improve contextual abstraction and mitigate semantic ambiguities. The Spectral Spline Encoder (SSE), incorporating the Multi-Frequency Feature Modulator (MFFM) and Kolmogorov–Arnold Networks (KAN) Channel Attention, transitions image representations into the frequency domain to selectively attenuate noise while preserving essential structural features. The Hierarchical Deformable Morphometry Encoder (HDME) employs deformable convolutions and multi-scale encoding to capture heterogeneous geometric structures dynamically. The outputs from these branches are synthesized through the Multi-Attention KAN Decoder, which employs KAN multiple self-attention and iterative attentional fusion to select and enhance semantic, spectral, and morphological critical domain features adaptively. Extensive experiments across six widely recognized datasets demonstrate that MasLVM achieves convincing performance compared with multiple previous state-of-the-art (SoTA) methods, and potential utility in adapting to diverse requirements of visible-light medical imaging tasks under constrained conditions. The code and model weights can be directly used for medical task deployment or fine-tuning, and are publicly available at the following link: <span><span>https://github.com/IMOP-lab/MasLVM-Pytorch</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"125 \",\"pages\":\"Article 103385\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525004580\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004580","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
准确的病灶量化是精确诊断和靶向治疗策略的关键组成部分,然而,当面对可见光医学成像中固有的各种上下文和复杂结构时,当前的方法面临挑战,包括语义模糊、噪声干扰和几何复杂性,这些因素共同阻碍了分割的准确性。针对这些挑战,我们提出了多面向大视觉模型(Multi-Aspect Large Vision Model, MasLVM),这是光学医学成像的基础模型,可以实现跨三路径融合的全面特征融合。语义上下文编码器(Semantic Context Encoder, SCE)将预训练的大视觉模型与全局语义嵌入相结合,以提高上下文抽象和减轻语义歧义。谱样条编码器(SSE),结合多频特征调制器(MFFM)和Kolmogorov-Arnold网络(KAN)通道注意,将图像表示转换到频域,在保留基本结构特征的同时选择性地衰减噪声。层次可变形形态编码器(HDME)采用可变形卷积和多尺度编码来动态捕获异质几何结构。这些分支的输出通过多注意KAN解码器进行合成,该解码器采用KAN多重自注意和迭代注意融合自适应地选择和增强语义、谱和形态关键域特征。在六个广泛认可的数据集上进行的大量实验表明,与先前的多种最先进的(SoTA)方法相比,MasLVM取得了令人信服的性能,并且在适应受限条件下可见光医学成像任务的各种要求方面具有潜在的实用性。代码和模型权重可直接用于医疗任务部署或微调,并可从以下链接公开获取:https://github.com/IMOP-lab/MasLVM-Pytorch。
Multi-aspect fusion in foundational large vision model for visible light medical imaging segmentation
Accurate lesion quantification represents a critical component of precision diagnostics and targeted therapeutic strategies, yet current methodologies face challenges when confronted with the diverse contextual and complicated structures inherent in visible-light medical imaging, including semantic ambiguity, noise interference, and geometric complexity, which collectively hinder segmentation accuracy. Targeting these challenges, we proposes the Multi-Aspect Large Vision Model (MasLVM), a foundational model for optical medical imaging that achieves comprehensive feature fusion across Tri-Path fusion. The Semantic Context Encoder (SCE) integrates a pre-trained large vision model with global semantic embeddings to improve contextual abstraction and mitigate semantic ambiguities. The Spectral Spline Encoder (SSE), incorporating the Multi-Frequency Feature Modulator (MFFM) and Kolmogorov–Arnold Networks (KAN) Channel Attention, transitions image representations into the frequency domain to selectively attenuate noise while preserving essential structural features. The Hierarchical Deformable Morphometry Encoder (HDME) employs deformable convolutions and multi-scale encoding to capture heterogeneous geometric structures dynamically. The outputs from these branches are synthesized through the Multi-Attention KAN Decoder, which employs KAN multiple self-attention and iterative attentional fusion to select and enhance semantic, spectral, and morphological critical domain features adaptively. Extensive experiments across six widely recognized datasets demonstrate that MasLVM achieves convincing performance compared with multiple previous state-of-the-art (SoTA) methods, and potential utility in adapting to diverse requirements of visible-light medical imaging tasks under constrained conditions. The code and model weights can be directly used for medical task deployment or fine-tuning, and are publicly available at the following link: https://github.com/IMOP-lab/MasLVM-Pytorch.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.