Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521825

Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang

{"title":"Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception","authors":"Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang","doi":"10.1109/TMM.2024.3521825","DOIUrl":null,"url":null,"abstract":"Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU<inline-formula><tex-math>$^\\text{ss}$</tex-math></inline-formula> score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"466-476"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814050/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU

$^\text{ss}$

score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

查看原文本刊更多论文

基于扩散的视觉感知的内隐和外显语言引导

文本到图像扩散模型在条件图像合成方面显示出强大的能力。通过大规模的视觉语言预训练，扩散模型能够在不同的文本提示下生成纹理丰富、结构合理的高质量图像。然而，将预训练的扩散模型用于视觉感知是一个悬而未决的问题。在本文中，我们提出了一个基于扩散的视觉感知的隐式和显式语言指导框架，称为IEDP。我们的IEDP包括一个隐式语言指导分支和一个显式语言指导分支。隐式分支使用冻结的CLIP图像编码器直接生成隐式文本嵌入，这些嵌入被馈送到扩散模型，而不需要显式文本提示。显式分支使用相应图像的真值标签作为文本提示来约束扩散模型中的特征提取。在训练过程中，我们通过共享这两个分支的模型权值来联合训练扩散模型。因此，隐式和显式分支可以共同指导特征学习。在推理过程中，我们只使用隐式分支进行最终预测，不需要任何真值标签。在语义分割和深度估计两种典型的感知任务上进行了实验。我们的IEDP在这两项任务上都取得了令人满意的表现。对于语义分割，我们的IEDP在ADE20K验证集上的mIoU$^\text{ss}$得分为55.9%，比基准方法VPD高出2.2%。对于深度估计，我们的IEDP以11.0%的相对增益优于基准方法VPD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.