{"title":"Image Caption Model Based on Multi-Head Attention and Encoder-Decoder Framework","authors":"Jianwei Luo, Li Ma","doi":"10.1109/ISKE47853.2019.9170306","DOIUrl":null,"url":null,"abstract":"In recently, image caption tasks are solved by using the LSTM to generate description. However, the model only accords image features and is hard to learn existing syntactic features, thereby lead to generate inaccurate description. In this paper, an image captioning model based on multi-head attention mechanism is presented. Specifically, the proposed model adopts Encoder-Decoder framework. A five-layer ResNet is used in Encoder module to extract image features. Multi-head attention layer and full connection feed forward layer are added to Decoder module. In addition, to capture the order of extracting feature sequences, the position-coded is used as a determining factor While calculating multi-head self-attention. Compared With the other current models based on various visual attention mechanisms, experimental results show that the proposed model has better performance.","PeriodicalId":399084,"journal":{"name":"2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISKE47853.2019.9170306","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In recently, image caption tasks are solved by using the LSTM to generate description. However, the model only accords image features and is hard to learn existing syntactic features, thereby lead to generate inaccurate description. In this paper, an image captioning model based on multi-head attention mechanism is presented. Specifically, the proposed model adopts Encoder-Decoder framework. A five-layer ResNet is used in Encoder module to extract image features. Multi-head attention layer and full connection feed forward layer are added to Decoder module. In addition, to capture the order of extracting feature sequences, the position-coded is used as a determining factor While calculating multi-head self-attention. Compared With the other current models based on various visual attention mechanisms, experimental results show that the proposed model has better performance.