基于注意力的端到端短视频分类

2022 18th International Conference on Mobility, Sensing and Networking (MSN) Pub Date : 2022-12-01 DOI:10.1109/MSN57253.2022.00084

Hui Zhu, Chao Zou, Zhenyu Wang, Kai Xu, Zihao Huang

{"title":"基于注意力的端到端短视频分类","authors":"Hui Zhu, Chao Zou, Zhenyu Wang, Kai Xu, Zihao Huang","doi":"10.1109/MSN57253.2022.00084","DOIUrl":null,"url":null,"abstract":"It has been proved that three-dimensional (3D) convolutional kernel can effectively capture local features in the spatiotemporal range of videos, leading to impressive results of various models in video-related tasks. With the introduction of Transformer and the rise of self-attention mechanism, more self-attention models have been used on video representation learning recently. However, there exist limitations of local perception and self-attention operations respectively in both two types of models. Inspired by the global context network (GCNet), we take advantages of both 3D convolution and self-attention mechanism to design a novel operator called the GC-Conv block. The block performs local feature extraction and global context modeling with channel-level concatenation similarly to the dense connectivity pattern in DenseNet, which maintains the lightweight property at the same time. Furthermore, we apply it for multiple layers of our proposed end-to-end network in short video classification task while the temporal dependency is captured via dilated convolutions and bidirectional GRU for better representation. Finally, our model outperforms both state-of-the-art convolutional models and self-attention models on three human action recognition datasets with considerably fewer parameters, which demonstrates the effectiveness.","PeriodicalId":114459,"journal":{"name":"2022 18th International Conference on Mobility, Sensing and Networking (MSN)","volume":"33 3-4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Attention Based End-to-End Network for Short Video Classification\",\"authors\":\"Hui Zhu, Chao Zou, Zhenyu Wang, Kai Xu, Zihao Huang\",\"doi\":\"10.1109/MSN57253.2022.00084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It has been proved that three-dimensional (3D) convolutional kernel can effectively capture local features in the spatiotemporal range of videos, leading to impressive results of various models in video-related tasks. With the introduction of Transformer and the rise of self-attention mechanism, more self-attention models have been used on video representation learning recently. However, there exist limitations of local perception and self-attention operations respectively in both two types of models. Inspired by the global context network (GCNet), we take advantages of both 3D convolution and self-attention mechanism to design a novel operator called the GC-Conv block. The block performs local feature extraction and global context modeling with channel-level concatenation similarly to the dense connectivity pattern in DenseNet, which maintains the lightweight property at the same time. Furthermore, we apply it for multiple layers of our proposed end-to-end network in short video classification task while the temporal dependency is captured via dilated convolutions and bidirectional GRU for better representation. Finally, our model outperforms both state-of-the-art convolutional models and self-attention models on three human action recognition datasets with considerably fewer parameters, which demonstrates the effectiveness.\",\"PeriodicalId\":114459,\"journal\":{\"name\":\"2022 18th International Conference on Mobility, Sensing and Networking (MSN)\",\"volume\":\"33 3-4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 18th International Conference on Mobility, Sensing and Networking (MSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSN57253.2022.00084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 18th International Conference on Mobility, Sensing and Networking (MSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSN57253.2022.00084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

已经证明三维卷积核可以有效地捕获视频时空范围内的局部特征，使得各种模型在视频相关任务中取得了令人印象深刻的结果。随着Transformer的引入和自注意机制的兴起，近年来越来越多的自注意模型被用于视频表示学习。然而，这两种模型在局部感知和自注意操作方面都存在各自的局限性。受全球上下文网络(GCNet)的启发，我们利用三维卷积和自关注机制设计了一种新的算子GC-Conv块。该块通过通道级连接执行局部特征提取和全局上下文建模，类似于DenseNet中的密集连接模式，同时保持轻量级属性。此外，我们将其应用于我们提出的端到端网络的多层短视频分类任务，同时通过扩展卷积和双向GRU捕获时间依赖性以获得更好的表示。最后，我们的模型在三个参数少得多的人类动作识别数据集上优于最先进的卷积模型和自关注模型，证明了该模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Attention Based End-to-End Network for Short Video Classification

It has been proved that three-dimensional (3D) convolutional kernel can effectively capture local features in the spatiotemporal range of videos, leading to impressive results of various models in video-related tasks. With the introduction of Transformer and the rise of self-attention mechanism, more self-attention models have been used on video representation learning recently. However, there exist limitations of local perception and self-attention operations respectively in both two types of models. Inspired by the global context network (GCNet), we take advantages of both 3D convolution and self-attention mechanism to design a novel operator called the GC-Conv block. The block performs local feature extraction and global context modeling with channel-level concatenation similarly to the dense connectivity pattern in DenseNet, which maintains the lightweight property at the same time. Furthermore, we apply it for multiple layers of our proposed end-to-end network in short video classification task while the temporal dependency is captured via dilated convolutions and bidirectional GRU for better representation. Finally, our model outperforms both state-of-the-art convolutional models and self-attention models on three human action recognition datasets with considerably fewer parameters, which demonstrates the effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 18th International Conference on Mobility, Sensing and Networking (MSN)

自引率

0.00%

发文量