{"title":"Test-time adaptation for object detection via Dynamic Dual Teaching","authors":"Siqi Zhang , Lu Zhang , Zhiyong Liu","doi":"10.1016/j.imavis.2025.105740","DOIUrl":null,"url":null,"abstract":"<div><div>Test-Time Adaptation (TTA) is a practical setting in real-world applications, which aims to adapt a source-trained model to target domains with online unlabeled test data streams. Current approaches often rely on self-training, utilizing supervision signals from the source-trained model, suffering from poor adaptation due to diverse domain shifts. In this paper, we propose a novel test-time adaptation method for object detection guided by dual teachers, termed <strong>D</strong>ynamic <strong>D</strong>ual <strong>T</strong>eaching (<strong>DDT</strong>). Inspired by the generalization potentials of Vision-Language Models (VLMs), we integrate the VLM as an additional language-driven instructor. This integration exploits the domain-robustness of language prompts to mitigate domain shifts, collaborating with the teacher of source information within the teacher–student framework. Firstly, we utilize an ensemble prompt to guide the prediction process of the language-driven instructor. Secondly, a dynamic fusion strategy of the dual teachers is designed to generate high-quality pseudo-labels for student learning. Moreover, we incorporate a dual prediction consistency regularization to further mitigate the sensitivity of the adapted detector to domain shifts. Experiments on diverse domain adaptation benchmarks demonstrate that the proposed DDT method achieves state-of-the-art performance on both online and offline domain adaptation settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105740"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003282","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Test-Time Adaptation (TTA) is a practical setting in real-world applications, which aims to adapt a source-trained model to target domains with online unlabeled test data streams. Current approaches often rely on self-training, utilizing supervision signals from the source-trained model, suffering from poor adaptation due to diverse domain shifts. In this paper, we propose a novel test-time adaptation method for object detection guided by dual teachers, termed Dynamic Dual Teaching (DDT). Inspired by the generalization potentials of Vision-Language Models (VLMs), we integrate the VLM as an additional language-driven instructor. This integration exploits the domain-robustness of language prompts to mitigate domain shifts, collaborating with the teacher of source information within the teacher–student framework. Firstly, we utilize an ensemble prompt to guide the prediction process of the language-driven instructor. Secondly, a dynamic fusion strategy of the dual teachers is designed to generate high-quality pseudo-labels for student learning. Moreover, we incorporate a dual prediction consistency regularization to further mitigate the sensitivity of the adapted detector to domain shifts. Experiments on diverse domain adaptation benchmarks demonstrate that the proposed DDT method achieves state-of-the-art performance on both online and offline domain adaptation settings.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.