Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI:10.1109/TCSVT.2025.3558785

Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma

{"title":"Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition","authors":"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma","doi":"10.1109/TCSVT.2025.3558785","DOIUrl":null,"url":null,"abstract":"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9246-9260"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10960322/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.

查看原文本刊更多论文

无监督少镜头动作识别的视觉语言自适应聚类和元适应

无监督少镜头动作识别是一项实用但具有挑战性的任务，它将从未标记视频中学习到的知识适应于只有有限标记数据的新动作课程。如果元学习没有基本动作类的注释数据，则由于伪类和情节的质量较低，无法达到令人满意的性能。虽然视觉语言预训练模型（如CLIP）可以用来提高伪类和伪集的质量，但在缺乏文本情态信息的情况下，仅使用视觉编码器可能仍然会限制性能的提高。在本文中，我们提出在无监督视频元学习的新框架中充分利用预训练视觉语言模型CLIP的多模态知识。文本模态由视频到文本转换器自动为每个未标记的视频生成。提出了基于视频文本集成距离度量的情景抽样多模态自适应聚类（mace）方法，以准确估计伪类，构建高质量的片段训练任务（片段）。视觉语言元适应（VLMA）是一种基于类别感知的视觉语言对比学习和基于置信度的双向可靠知识蒸馏的方法，旨在使预先训练好的模型适应新的任务。最后通过多模态自适应推理得到预测结果。在五个基准上的大量实验证明了我们的方法在无监督少镜头动作识别方面的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.