探索细粒度的视觉文本特征对齐与提示调整领域自适应对象检测

IF 9.4 1区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo
{"title":"探索细粒度的视觉文本特征对齐与提示调整领域自适应对象检测","authors":"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo","doi":"10.1109/TCYB.2025.3567126","DOIUrl":null,"url":null,"abstract":"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"55 7","pages":"3220-3233"},"PeriodicalIF":9.4000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection\",\"authors\":\"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo\",\"doi\":\"10.1109/TCYB.2025.3567126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.\",\"PeriodicalId\":13112,\"journal\":{\"name\":\"IEEE Transactions on Cybernetics\",\"volume\":\"55 7\",\"pages\":\"3220-3233\"},\"PeriodicalIF\":9.4000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cybernetics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11007146/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007146/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

域自适应目标检测(DAOD)旨在通过减轻域偏差,将在标记的源域训练好的检测器推广到未标记的目标域。最近的研究证实,预训练的视觉语言模型(VLMs)是提高检测器泛化能力的有前途的工具。然而,在大多数现有的工作中,单域检测与DAOD任务之间存在范式差异,这可能会阻碍跨域视觉文本特征的细粒度对齐。此外,针对这些差异的一些初步解决方案可能潜在地忽略了提示和跨模态信息交互中的关系推理,而这对于细粒度对齐至关重要。为此,本文探讨了DAOD中具有快速调优的细粒度视觉文本特征对齐,并组织了一个名为FGPro的新框架,该框架包含三个精心设计的级别。首先,在提示层组织一个可学习的领域自适应提示,构建提示关系编码器来推断提示中的互标记语义关系。在模型级别上,双向跨模态关注被构建为完全交互视觉和文本细粒度信息。此外,我们定制了一个提示引导的跨域正则化策略,以一种解纠缠的方式将域不变和特定于域的信息注入提示中。这三种设计有效地对齐了源-目标领域的细粒度视觉文本特征,以促进领域感知信息的捕获。在四种跨域场景下的实验表明,FGPro比现有工作表现出显著的性能提升(跨天气:+1.0% AP50;模拟到真实:+1.2% AP50;跨镜头:+1.3% AP50;行业:+2.8% AP50),验证了其细粒度对齐的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection
Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Cybernetics
IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS
CiteScore
25.40
自引率
11.00%
发文量
1869
期刊介绍: The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信