Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei
{"title":"Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer","authors":"Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei","doi":"10.1145/3503161.3548008","DOIUrl":null,"url":null,"abstract":"Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.
零镜头视频分类(Zero-shot video classification, ZSVC)是一种旨在识别模型训练中从未出现过的视频类别的方法,已经成为一个蓬勃发展的研究方向。ZSVC通过在视觉嵌入和语义嵌入之间建立映射来实现。最近,ZSVC通过自动挖掘视频中的底层对象作为属性并结合外部常识性知识来实现。然而,从已见类别中挖掘的对象不能推广到未见类别。此外,类别-对象关系通常是从常识性知识或词嵌入中提取的,这与视频的情态不一致。为了解决这些问题,我们建议从检索到的web图像中挖掘相关对象和每个类别的类别-对象关系。将所有类别的关联对象作为通用属性,挖掘的类别-对象关系可以缩小模态不一致,从而更好地进行知识转移。现有ZSVC方法的另一个问题是,用标记的可见类别进行充分训练的模型可能不能很好地推广到不同的未见类别。为了鼓励更可靠的迁移,我们提出了任务相似感知表示学习(TSRL)。在TSRL中,估计可见类别与未见类别之间的相似度,并以适当的方式对模型进行正则化。基于构造的属性、挖掘的类别-对象关系和提出的TSRL,构建了ZSVC模型。在FCVID、UCF101、HMDB51和Olympic Sports四个公共数据集上的实验结果表明,我们的模型与目前最先进的方法相比具有良好的性能。我们的代码可在https://github.com/junbaoZHUO/TSRL上公开获取。