无监督少镜头动作识别的视觉语言自适应聚类和元适应

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma
{"title":"无监督少镜头动作识别的视觉语言自适应聚类和元适应","authors":"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma","doi":"10.1109/TCSVT.2025.3558785","DOIUrl":null,"url":null,"abstract":"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9246-9260"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition\",\"authors\":\"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma\",\"doi\":\"10.1109/TCSVT.2025.3558785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"9246-9260\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10960322/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10960322/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

无监督少镜头动作识别是一项实用但具有挑战性的任务,它将从未标记视频中学习到的知识适应于只有有限标记数据的新动作课程。如果元学习没有基本动作类的注释数据,则由于伪类和情节的质量较低,无法达到令人满意的性能。虽然视觉语言预训练模型(如CLIP)可以用来提高伪类和伪集的质量,但在缺乏文本情态信息的情况下,仅使用视觉编码器可能仍然会限制性能的提高。在本文中,我们提出在无监督视频元学习的新框架中充分利用预训练视觉语言模型CLIP的多模态知识。文本模态由视频到文本转换器自动为每个未标记的视频生成。提出了基于视频文本集成距离度量的情景抽样多模态自适应聚类(mace)方法,以准确估计伪类,构建高质量的片段训练任务(片段)。视觉语言元适应(VLMA)是一种基于类别感知的视觉语言对比学习和基于置信度的双向可靠知识蒸馏的方法,旨在使预先训练好的模型适应新的任务。最后通过多模态自适应推理得到预测结果。在五个基准上的大量实验证明了我们的方法在无监督少镜头动作识别方面的优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信