Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer

Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei
{"title":"Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer","authors":"Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei","doi":"10.1145/3503161.3548008","DOIUrl":null,"url":null,"abstract":"Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.
零镜头视频分类与适当的Web和任务知识转移
零镜头视频分类(Zero-shot video classification, ZSVC)是一种旨在识别模型训练中从未出现过的视频类别的方法,已经成为一个蓬勃发展的研究方向。ZSVC通过在视觉嵌入和语义嵌入之间建立映射来实现。最近,ZSVC通过自动挖掘视频中的底层对象作为属性并结合外部常识性知识来实现。然而,从已见类别中挖掘的对象不能推广到未见类别。此外,类别-对象关系通常是从常识性知识或词嵌入中提取的,这与视频的情态不一致。为了解决这些问题,我们建议从检索到的web图像中挖掘相关对象和每个类别的类别-对象关系。将所有类别的关联对象作为通用属性,挖掘的类别-对象关系可以缩小模态不一致,从而更好地进行知识转移。现有ZSVC方法的另一个问题是,用标记的可见类别进行充分训练的模型可能不能很好地推广到不同的未见类别。为了鼓励更可靠的迁移,我们提出了任务相似感知表示学习(TSRL)。在TSRL中,估计可见类别与未见类别之间的相似度,并以适当的方式对模型进行正则化。基于构造的属性、挖掘的类别-对象关系和提出的TSRL,构建了ZSVC模型。在FCVID、UCF101、HMDB51和Olympic Sports四个公共数据集上的实验结果表明,我们的模型与目前最先进的方法相比具有良好的性能。我们的代码可在https://github.com/junbaoZHUO/TSRL上公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信