{"title":"探索细粒度的视觉文本特征对齐与提示调整领域自适应对象检测","authors":"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo","doi":"10.1109/TCYB.2025.3567126","DOIUrl":null,"url":null,"abstract":"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"55 7","pages":"3220-3233"},"PeriodicalIF":9.4000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection\",\"authors\":\"Zhitao Wen;Jinhai Liu;Huaguang Zhang;Fengyuan Zuo\",\"doi\":\"10.1109/TCYB.2025.3567126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.\",\"PeriodicalId\":13112,\"journal\":{\"name\":\"IEEE Transactions on Cybernetics\",\"volume\":\"55 7\",\"pages\":\"3220-3233\"},\"PeriodicalIF\":9.4000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cybernetics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11007146/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007146/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Exploring Fine-Grained Visual-Text Feature Alignment With Prompt Tuning for Domain-Adaptive Object Detection
Domain-adaptive object detection (DAOD) aims to generalize detectors trained in labeled source domains to unlabeled target domains by mitigating domain bias. Recent studies have confirmed that pretrained vision-language models (VLMs) are promising tools to enhance the generalizability of detectors. However, there exist paradigm discrepancies between single-domain detection in most existing works and DAOD tasks, which may hinder the fine-grained alignment of cross-domain visual-text features. In addition, some preliminary solutions to these discrepancies may potentially neglect relational reasoning in prompts and cross-modal information interactions, which are crucial for fine-grained alignment. To this end, this article explores fine-grained visual-text feature alignment in DAOD with prompt tuning and organizes a novel framework termed FGPro that contains three elaborated levels. First, at the prompt level, a learnable domain-adaptive prompt is organized and a prompt relation encoder is constructed to infer intertoken semantic relations in the prompt. At the model level, a bidirectional cross-modal attention is structured to fully interact visual and textual fine-grained information. In addition, we customize a prompt-guided cross-domain regularization strategy to inject domain-invariant and domain-specific information into prompts in a disentangled manner. The three designs effectively align the fine-grained visual-text features of the source-target domain to facilitate the capture of domain-aware information. Experiments on four cross-domain scenarios show that FGPro exhibits notable performance improvements over existing work (Cross-weather: +1.0% AP50; Simulation-to-real: +1.2% AP50; Cross-camera: +1.3% AP50; Industry: +2.8% AP50), validating the effectiveness of its fine-grained alignment.
期刊介绍:
The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.