一种用于动作识别的关键体挖掘深度框架

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2016-06-27 DOI:10.1109/CVPR.2016.219

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, Y. Qiao

{"title":"一种用于动作识别的关键体挖掘深度框架","authors":"Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, Y. Qiao","doi":"10.1109/CVPR.2016.219","DOIUrl":null,"url":null,"abstract":"Recently, deep learning approaches have demonstrated remarkable progresses for action recognition in videos. Most existing deep frameworks equally treat every volume i.e. spatial-temporal video clip, and directly assign a video label to all volumes sampled from it. However, within a video, discriminative actions may occur sparsely in a few key volumes, and most other volumes are irrelevant to the labeled action category. Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained is optimized in an alternative way integrated to the forward and backward stages of Stochastic Gradient Descent (SGD). In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose \"Stochastic out\" to model key volumes from multi-modalities, and an effective yet simple \"unsupervised key volume proposal\" method for high quality volume sampling. Our experiments show that action recognition performance can be significantly improved by mining key volumes, and we achieve state-of-the-art performance on HMDB51 and UCF101 (93.1%).","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"1991-1999"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"230","resultStr":"{\"title\":\"A Key Volume Mining Deep Framework for Action Recognition\",\"authors\":\"Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, Y. Qiao\",\"doi\":\"10.1109/CVPR.2016.219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, deep learning approaches have demonstrated remarkable progresses for action recognition in videos. Most existing deep frameworks equally treat every volume i.e. spatial-temporal video clip, and directly assign a video label to all volumes sampled from it. However, within a video, discriminative actions may occur sparsely in a few key volumes, and most other volumes are irrelevant to the labeled action category. Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained is optimized in an alternative way integrated to the forward and backward stages of Stochastic Gradient Descent (SGD). In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose \\\"Stochastic out\\\" to model key volumes from multi-modalities, and an effective yet simple \\\"unsupervised key volume proposal\\\" method for high quality volume sampling. Our experiments show that action recognition performance can be significantly improved by mining key volumes, and we achieve state-of-the-art performance on HMDB51 and UCF101 (93.1%).\",\"PeriodicalId\":6515,\"journal\":{\"name\":\"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"34 1\",\"pages\":\"1991-1999\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"230\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR.2016.219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2016.219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 230

摘要

最近，深度学习方法在视频动作识别方面取得了显著进展。大多数现有的深度框架平等地对待每个卷，即时空视频剪辑，并直接为从中采样的所有卷分配视频标签。然而，在视频中，判别行为可能会稀疏地出现在几个关键卷中，而大多数其他卷与标记的动作类别无关。使用大量不相关的数据量进行训练将会影响成绩。为了解决这个问题，我们提出了一个密钥卷挖掘深度框架来识别密钥卷并同时进行分类。具体来说，我们的框架以一种与随机梯度下降(SGD)的前向和后向阶段相结合的替代方式进行训练和优化。在向前传递中，我们的网络为每个动作类挖掘关键卷。在反向传递中，它借助这些挖掘的密钥卷更新网络参数。此外，我们提出了一种基于多模态的“随机输出”方法来模拟关键卷，并提出了一种有效而简单的“无监督关键卷建议”方法来进行高质量的体积采样。我们的实验表明，通过挖掘密钥量可以显著提高动作识别性能，并且我们在HMDB51和UCF101上达到了最先进的性能(93.1%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Key Volume Mining Deep Framework for Action Recognition

Recently, deep learning approaches have demonstrated remarkable progresses for action recognition in videos. Most existing deep frameworks equally treat every volume i.e. spatial-temporal video clip, and directly assign a video label to all volumes sampled from it. However, within a video, discriminative actions may occur sparsely in a few key volumes, and most other volumes are irrelevant to the labeled action category. Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained is optimized in an alternative way integrated to the forward and backward stages of Stochastic Gradient Descent (SGD). In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose "Stochastic out" to model key volumes from multi-modalities, and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling. Our experiments show that action recognition performance can be significantly improved by mining key volumes, and we achieve state-of-the-art performance on HMDB51 and UCF101 (93.1%).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量