探索细粒度的视觉文本特征对齐与提示调整领域自适应对象检测

IF 9.4 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Cybernetics Pub Date : 2025-03-19 DOI:10.1109/TCYB.2025.3567126

Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo

{"title":"探索细粒度的视觉文本特征对齐与提示调整领域自适应对象检测","authors":"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo","doi":"10.1109/TCYB.2025.3567126","DOIUrl":null,"url":null,"abstract":"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"55 7","pages":"3220-3233"},"PeriodicalIF":9.4000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection\",\"authors\":\"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo\",\"doi\":\"10.1109/TCYB.2025.3567126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.\",\"PeriodicalId\":13112,\"journal\":{\"name\":\"IEEE Transactions on Cybernetics\",\"volume\":\"55 7\",\"pages\":\"3220-3233\"},\"PeriodicalIF\":9.4000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cybernetics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11007146/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007146/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

域自适应目标检测（DAOD）旨在通过减轻域偏差，将在标记的源域训练好的检测器推广到未标记的目标域。最近的研究证实，预训练的视觉语言模型（VLMs）是提高检测器泛化能力的有前途的工具。然而，在大多数现有的工作中，单域检测与DAOD任务之间存在范式差异，这可能会阻碍跨域视觉文本特征的细粒度对齐。此外，针对这些差异的一些初步解决方案可能潜在地忽略了提示和跨模态信息交互中的关系推理，而这对于细粒度对齐至关重要。为此，本文探讨了DAOD中具有快速调优的细粒度视觉文本特征对齐，并组织了一个名为FGPro的新框架，该框架包含三个精心设计的级别。首先，在提示层组织一个可学习的领域自适应提示，构建提示关系编码器来推断提示中的互标记语义关系。在模型级别上，双向跨模态关注被构建为完全交互视觉和文本细粒度信息。此外，我们定制了一个提示引导的跨域正则化策略，以一种解纠缠的方式将域不变和特定于域的信息注入提示中。这三种设计有效地对齐了源-目标领域的细粒度视觉文本特征，以促进领域感知信息的捕获。在四种跨域场景下的实验表明，FGPro比现有工作表现出显著的性能提升(跨天气：+1.0% AP50；模拟到真实：+1.2% AP50；跨镜头：+1.3% AP50；行业：+2.8% AP50)，验证了其细粒度对齐的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection

Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

25.40

自引率

11.00%

发文量

1869

期刊介绍： The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.