VLDadaptor: Domain Adaptive Object Detection With Vision-Language Model Distillation

IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Junjie Ke;Lihuo He;Bo Han;Jie Li;Di Wang;Xinbo Gao
{"title":"VLDadaptor: Domain Adaptive Object Detection With Vision-Language Model Distillation","authors":"Junjie Ke;Lihuo He;Bo Han;Jie Li;Di Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3453061","DOIUrl":null,"url":null,"abstract":"Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domain-mixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11316-11331"},"PeriodicalIF":8.4000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10669066/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domain-mixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time.
VLDadaptor:通过视觉语言模型提炼实现领域自适应目标检测
域自适应物体检测(DAOD)旨在开发一种在已标注源域上经过训练的检测器,以识别未标注目标域中的物体。DAOD 面临的一个主要挑战是域偏移问题。大多数现有方法都是在单域嵌入空间内学习域不变特征,由于源域的内在数据属性,往往会导致严重的模型偏差。为了减轻模型偏差,本文提出了一种基于视觉语言模型(VLMs)提炼的域自适应物体检测器--VLDadaptor。首先,本文提出的方法在 CLIP 视觉编码器和检测器之间集成了领域混合对比知识蒸馏,通过转移类别级实例特征,保证检测器能够跨领域提取领域不变的视觉实例特征。然后,VLDadaptor 通过将文本提示嵌入与视觉实例特征对齐,在 CLIP 文本编码器和检测器之间进行域混合一致性提炼,这有助于保持检测器、文本编码器和 VLM 视觉编码器之间的类别级特征一致性。最后,通过采用基于提示的记忆库来生成用于图匹配的语义完整特征,所提出的方法进一步提高了适应能力。这些贡献使 VLDadaptor 能够在视觉语言嵌入空间中提取视觉特征,而不会对特定领域产生明显的模型偏差。广泛的实验结果表明,所提出的方法在 Pascal VOC 到剪贴画的适配任务中取得了最先进的性能,并在驾驶场景任务中表现出较高的准确性,同时大大减少了训练时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信