Guodong Fan , Shengning Zhou , Zhen Hua , Jinjiang Li , Jingchun Zhou
{"title":"基于llava的水下图像增强语义特征调制扩散模型","authors":"Guodong Fan , Shengning Zhou , Zhen Hua , Jinjiang Li , Jingchun Zhou","doi":"10.1016/j.inffus.2025.103566","DOIUrl":null,"url":null,"abstract":"<div><div>Underwater Image Enhancement (UIE) is critical for numerous marine applications; however, existing methods often fall short in addressing severe color distortion, detail loss, and lack of semantic understanding, particularly under spatially varying degradation conditions. While Generative AI (GenAI), particularly diffusion models and multimodal large language models (MLLMs), offers new prospects for UIE, effectively leveraging their capabilities for fine-grained, semantic-aware enhancement remains a challenge. We proposed a LLaVA-based semantic feature modulation diffusion model (LSFM-Diff), which integrates multi-level semantic guidance collaboratively into the backbone network of the diffusion model. Specifically, an optimized prompt learning strategy is first employed to obtain concise, UIE-relevant textual descriptions from LLaVA. These semantics then guide the enhancement process in two key stages: (1) The windowed text-image fusion for condition refinement (WTIF-CR) module aligns and fuses textual semantics with local image features spatially, generating fine-grained external conditions that provide an initial spatially aware semantic blueprint for the diffusion model. (2) The semantic-guided deformable attention (SGDA) mechanism, leveraging a gradient-based image-text interaction to generate a semantic navigation map, guides the attention within the denoising network to focus on key semantic regions. Experiments conducted on several challenging benchmark datasets demonstrate that LSFM-Diff outperforms current state-of-the-art methods. Our work highlights the effectiveness of deep integration of multi-level semantic guidance fusion strategies in advancing GenAI-based UIE development.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"126 ","pages":"Article 103566"},"PeriodicalIF":15.5000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLaVA-based semantic feature modulation diffusion model for underwater image enhancement\",\"authors\":\"Guodong Fan , Shengning Zhou , Zhen Hua , Jinjiang Li , Jingchun Zhou\",\"doi\":\"10.1016/j.inffus.2025.103566\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Underwater Image Enhancement (UIE) is critical for numerous marine applications; however, existing methods often fall short in addressing severe color distortion, detail loss, and lack of semantic understanding, particularly under spatially varying degradation conditions. While Generative AI (GenAI), particularly diffusion models and multimodal large language models (MLLMs), offers new prospects for UIE, effectively leveraging their capabilities for fine-grained, semantic-aware enhancement remains a challenge. We proposed a LLaVA-based semantic feature modulation diffusion model (LSFM-Diff), which integrates multi-level semantic guidance collaboratively into the backbone network of the diffusion model. Specifically, an optimized prompt learning strategy is first employed to obtain concise, UIE-relevant textual descriptions from LLaVA. These semantics then guide the enhancement process in two key stages: (1) The windowed text-image fusion for condition refinement (WTIF-CR) module aligns and fuses textual semantics with local image features spatially, generating fine-grained external conditions that provide an initial spatially aware semantic blueprint for the diffusion model. (2) The semantic-guided deformable attention (SGDA) mechanism, leveraging a gradient-based image-text interaction to generate a semantic navigation map, guides the attention within the denoising network to focus on key semantic regions. Experiments conducted on several challenging benchmark datasets demonstrate that LSFM-Diff outperforms current state-of-the-art methods. Our work highlights the effectiveness of deep integration of multi-level semantic guidance fusion strategies in advancing GenAI-based UIE development.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"126 \",\"pages\":\"Article 103566\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525006384\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525006384","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
LLaVA-based semantic feature modulation diffusion model for underwater image enhancement
Underwater Image Enhancement (UIE) is critical for numerous marine applications; however, existing methods often fall short in addressing severe color distortion, detail loss, and lack of semantic understanding, particularly under spatially varying degradation conditions. While Generative AI (GenAI), particularly diffusion models and multimodal large language models (MLLMs), offers new prospects for UIE, effectively leveraging their capabilities for fine-grained, semantic-aware enhancement remains a challenge. We proposed a LLaVA-based semantic feature modulation diffusion model (LSFM-Diff), which integrates multi-level semantic guidance collaboratively into the backbone network of the diffusion model. Specifically, an optimized prompt learning strategy is first employed to obtain concise, UIE-relevant textual descriptions from LLaVA. These semantics then guide the enhancement process in two key stages: (1) The windowed text-image fusion for condition refinement (WTIF-CR) module aligns and fuses textual semantics with local image features spatially, generating fine-grained external conditions that provide an initial spatially aware semantic blueprint for the diffusion model. (2) The semantic-guided deformable attention (SGDA) mechanism, leveraging a gradient-based image-text interaction to generate a semantic navigation map, guides the attention within the denoising network to focus on key semantic regions. Experiments conducted on several challenging benchmark datasets demonstrate that LSFM-Diff outperforms current state-of-the-art methods. Our work highlights the effectiveness of deep integration of multi-level semantic guidance fusion strategies in advancing GenAI-based UIE development.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.