{"title":"短视频中动作识别的动作转换器","authors":"Yumeng Cai, Guoyong Cai, Jin Cai","doi":"10.1109/ICICIP53388.2021.9642184","DOIUrl":null,"url":null,"abstract":"Action recognition methods are mostly based on a 3-Dimensional (3D) Convolution Network which have some limitations in practice, e.g. redundant parameters, big memory consumed and low performance. In this paper, a new convolution-free model called action-transformer is proposed to address the mentioned problems. The model proposed is mainly composed of three modules: spatial-temporal transformation module, hybrid feature attention module, and residual-transformer module. The spatial-temporal transformation module is designed to map the split short video into spatial and temporal features. The hybrid feature attention module is designed to extract the fine-grained features from the spatial and temporal features and produce the hybrid features. The residual-transformer module is designed with the combination of the attention, feed-forward network, and the residual mechanism to extract local and global features from the hybrid features. The model is tested on the HMDB51 and UCFIOI data set, and the result shows that the memory, the parameters used by the proposed model are less than those models mentioned in the literature, and it achieves better performance too.","PeriodicalId":435799,"journal":{"name":"2021 11th International Conference on Intelligent Control and Information Processing (ICICIP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Action-Transformer for Action Recognition in Short Videos\",\"authors\":\"Yumeng Cai, Guoyong Cai, Jin Cai\",\"doi\":\"10.1109/ICICIP53388.2021.9642184\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Action recognition methods are mostly based on a 3-Dimensional (3D) Convolution Network which have some limitations in practice, e.g. redundant parameters, big memory consumed and low performance. In this paper, a new convolution-free model called action-transformer is proposed to address the mentioned problems. The model proposed is mainly composed of three modules: spatial-temporal transformation module, hybrid feature attention module, and residual-transformer module. The spatial-temporal transformation module is designed to map the split short video into spatial and temporal features. The hybrid feature attention module is designed to extract the fine-grained features from the spatial and temporal features and produce the hybrid features. The residual-transformer module is designed with the combination of the attention, feed-forward network, and the residual mechanism to extract local and global features from the hybrid features. The model is tested on the HMDB51 and UCFIOI data set, and the result shows that the memory, the parameters used by the proposed model are less than those models mentioned in the literature, and it achieves better performance too.\",\"PeriodicalId\":435799,\"journal\":{\"name\":\"2021 11th International Conference on Intelligent Control and Information Processing (ICICIP)\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 11th International Conference on Intelligent Control and Information Processing (ICICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICIP53388.2021.9642184\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 11th International Conference on Intelligent Control and Information Processing (ICICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICIP53388.2021.9642184","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Action-Transformer for Action Recognition in Short Videos
Action recognition methods are mostly based on a 3-Dimensional (3D) Convolution Network which have some limitations in practice, e.g. redundant parameters, big memory consumed and low performance. In this paper, a new convolution-free model called action-transformer is proposed to address the mentioned problems. The model proposed is mainly composed of three modules: spatial-temporal transformation module, hybrid feature attention module, and residual-transformer module. The spatial-temporal transformation module is designed to map the split short video into spatial and temporal features. The hybrid feature attention module is designed to extract the fine-grained features from the spatial and temporal features and produce the hybrid features. The residual-transformer module is designed with the combination of the attention, feed-forward network, and the residual mechanism to extract local and global features from the hybrid features. The model is tested on the HMDB51 and UCFIOI data set, and the result shows that the memory, the parameters used by the proposed model are less than those models mentioned in the literature, and it achieves better performance too.