Suvom Shaha, F. Shah, Amir Hossain Raj, Ashek Seum, Saiful Islam, Sifat Ahmed
{"title":"Video Captioning in Bengali With Visual Attention","authors":"Suvom Shaha, F. Shah, Amir Hossain Raj, Ashek Seum, Saiful Islam, Sifat Ahmed","doi":"10.1109/ICCIT57492.2022.10055190","DOIUrl":null,"url":null,"abstract":"Generating automatic video captions is one of the most challenging Artificial Intelligence tasks as it combines Computer Vision and Natural Language Processing research areas. The task is more difficult for a complex language like Bengali as there is a general lack of video captioning datasets in the Bengali language. To overcome this challenge, we introduce a fully human-annotated dataset of Bengali captions in this research for the videos of the MSVD dataset. We have proposed a novel end-to-end architecture with an attention-based decoder to generate meaningful video captions in the Bengali language. First, spatial and temporal features of videos are combined using Bidirectional Gated Recurrent Units (Bi-GRU) that generate the input feature, which is later fed to the attention layer along with embedded caption features. This attention mechanism explores the interdependence between visual and textual representations. Then, a double-layered GRU takes these combined attention features for generating meaningful sentences. We trained this model on our proposed dataset and achieved 39.35% in BLEU-4, 59.67% in CIDEr, and 65.34% score in ROUGE. This is the state-of-the-art result compared to any other video captioning work available in the Bengali language.","PeriodicalId":255498,"journal":{"name":"2022 25th International Conference on Computer and Information Technology (ICCIT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 25th International Conference on Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIT57492.2022.10055190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Generating automatic video captions is one of the most challenging Artificial Intelligence tasks as it combines Computer Vision and Natural Language Processing research areas. The task is more difficult for a complex language like Bengali as there is a general lack of video captioning datasets in the Bengali language. To overcome this challenge, we introduce a fully human-annotated dataset of Bengali captions in this research for the videos of the MSVD dataset. We have proposed a novel end-to-end architecture with an attention-based decoder to generate meaningful video captions in the Bengali language. First, spatial and temporal features of videos are combined using Bidirectional Gated Recurrent Units (Bi-GRU) that generate the input feature, which is later fed to the attention layer along with embedded caption features. This attention mechanism explores the interdependence between visual and textual representations. Then, a double-layered GRU takes these combined attention features for generating meaningful sentences. We trained this model on our proposed dataset and achieved 39.35% in BLEU-4, 59.67% in CIDEr, and 65.34% score in ROUGE. This is the state-of-the-art result compared to any other video captioning work available in the Bengali language.