Weakly Supervised Deep Reinforcement Learning for Video Summarization With Semantically Meaningful Reward

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI:10.1109/WACV48630.2021.00328

Zu-Hua Li, Lei Yang

{"title":"Weakly Supervised Deep Reinforcement Learning for Video Summarization With Semantically Meaningful Reward","authors":"Zu-Hua Li, Lei Yang","doi":"10.1109/WACV48630.2021.00328","DOIUrl":null,"url":null,"abstract":"Conventional unsupervised video summarization algorithms are usually developed in a frame level clustering manner For example, frame level diversity and representativeness are two typical clustering criteria used for unsupervised reinforcement learning-based video summarization. Inspired by recent progress in video representation techniques, we further introduce the similarity of video representations to construct a semantically meaningful reward for this task. We consider that a good summarization should also be semantically identical to its original source, which means that the semantic similarity can be regarded as an additional criterion for summarization. Through combining a novel video semantic reward with other unsupervised rewards for training, we can easily upgrade an unsupervised reinforcement learning-based video summarization method to its weakly supervised version. In practice, we first train a video classification sub-network (VCSN) to extract video semantic representations based on a category-labeled video dataset. Then we fix this VCSN and train a summary generation sub-network (SGSN) using unlabeled video data in a reinforcement learning way. Experimental results demonstrate that our work significantly surpasses other unsupervised and even supervised methods. To the best of our knowledge, our method achieves state-of-the-art performance in terms of the correlation coefficients, Kendall’s and Spearman’s .","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV48630.2021.00328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Conventional unsupervised video summarization algorithms are usually developed in a frame level clustering manner For example, frame level diversity and representativeness are two typical clustering criteria used for unsupervised reinforcement learning-based video summarization. Inspired by recent progress in video representation techniques, we further introduce the similarity of video representations to construct a semantically meaningful reward for this task. We consider that a good summarization should also be semantically identical to its original source, which means that the semantic similarity can be regarded as an additional criterion for summarization. Through combining a novel video semantic reward with other unsupervised rewards for training, we can easily upgrade an unsupervised reinforcement learning-based video summarization method to its weakly supervised version. In practice, we first train a video classification sub-network (VCSN) to extract video semantic representations based on a category-labeled video dataset. Then we fix this VCSN and train a summary generation sub-network (SGSN) using unlabeled video data in a reinforcement learning way. Experimental results demonstrate that our work significantly surpasses other unsupervised and even supervised methods. To the best of our knowledge, our method achieves state-of-the-art performance in terms of the correlation coefficients, Kendall’s and Spearman’s .

查看原文本刊更多论文

基于语义有意义奖励的视频摘要弱监督深度强化学习

传统的无监督视频摘要算法通常采用帧级聚类的方式进行开发，例如，帧级多样性和代表性是用于基于无监督强化学习的视频摘要的两个典型聚类标准。受视频表示技术最新进展的启发，我们进一步引入视频表示的相似性，为该任务构建语义上有意义的奖励。我们认为一个好的摘要还应该在语义上与其原文相同，这意味着语义相似度可以作为摘要的一个附加标准。通过将一种新的视频语义奖励与其他无监督的训练奖励相结合，我们可以很容易地将基于无监督强化学习的视频总结方法升级到弱监督版本。在实践中，我们首先训练视频分类子网络(VCSN)来提取基于类别标记的视频数据集的视频语义表示。然后对该VCSN进行修正，并以强化学习的方式使用无标记视频数据训练一个摘要生成子网络(SGSN)。实验结果表明，我们的工作明显优于其他无监督甚至有监督的方法。据我们所知，我们的方法在相关系数，Kendall和Spearman方面达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量