{"title":"Human Action Captioning based on a GRU+LSTM+Attention Model","authors":"Lijuan Zhou, Weicong Zhang, Xiaojie Qian","doi":"10.1145/3512576.3512606","DOIUrl":null,"url":null,"abstract":"To quickly understand human actions in the videos, this paper proposes to solve the human action captioning problem which aims to automatically generate text descriptions based on human action videos. A sequence-to-sequence method based on GRU+LSTM+Attention (GLA) model is proposed to solve this problem. Specifically, GRU is applied as the encoder to capture the temporal information of actions. The LSTM is applied as the decoder to generate the fluent fine-grained descriptions for human actions. To focus on the most relevant part of actions and capture the correlation between actions and descriptions, an attention mechanism is applied in the proposed method. Experiments on the WorkoutUOW-18 dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":278114,"journal":{"name":"Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512576.3512606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
To quickly understand human actions in the videos, this paper proposes to solve the human action captioning problem which aims to automatically generate text descriptions based on human action videos. A sequence-to-sequence method based on GRU+LSTM+Attention (GLA) model is proposed to solve this problem. Specifically, GRU is applied as the encoder to capture the temporal information of actions. The LSTM is applied as the decoder to generate the fluent fine-grained descriptions for human actions. To focus on the most relevant part of actions and capture the correlation between actions and descriptions, an attention mechanism is applied in the proposed method. Experiments on the WorkoutUOW-18 dataset demonstrate the effectiveness of the proposed method.