Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning

International Journal of Intelligent Computing and Information Sciences Pub Date : 2021-07-18 DOI:10.21608/IJICIS.2021.60216.1055

Noorhan Khaled, M. Aref, M. Marey

{"title":"Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning","authors":"Noorhan Khaled, M. Aref, M. Marey","doi":"10.21608/IJICIS.2021.60216.1055","DOIUrl":null,"url":null,"abstract":"Dense video captioning involves detecting interesting events and generating textual descriptions for each event in an untrimmed video. Many machine intelligent applications such as video summarization, search and retrieval, automatic video subtitling for supporting blind disabled people, benefit from automated dense captions generator. Most recent works attempted to make use of an encoder-decoder neural network framework which employs a 3D-CNN as an encoder for representing a detected event frames, and an RNN as a decoder for caption generation. They follow an attention based mechanism to learn where to focus in the encoded video frames during caption generation. Although the attention-based approaches have achieved excellent results, they directly link visual features to textual captions and ignore the rich intermediate/high-level video concepts such as people, objects, scenes, and actions. In this paper, we firstly propose to obtain a better event representation that discriminates between events nearly ending at the same time by applying an attention based fusion. Where hidden states from a bi-directional LSTM sequence video encoder, which encodes past and future surrounding context information of a detected event are fused along with its visual (R3D) features. Secondly, we propose to explicitly extract bi-modal semantic concepts (nouns and verbs) from a detected event segment and equilibrate the contributions from the proposed event representation and the semantic concepts dynamically using a gating mechanism while captioning. Experimental results demonstrates that our proposed attention based fusion is better in representing an event for captioning. Also involving semantic concepts improves captioning performance.","PeriodicalId":244591,"journal":{"name":"International Journal of Intelligent Computing and Information Sciences","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Computing and Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/IJICIS.2021.60216.1055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dense video captioning involves detecting interesting events and generating textual descriptions for each event in an untrimmed video. Many machine intelligent applications such as video summarization, search and retrieval, automatic video subtitling for supporting blind disabled people, benefit from automated dense captions generator. Most recent works attempted to make use of an encoder-decoder neural network framework which employs a 3D-CNN as an encoder for representing a detected event frames, and an RNN as a decoder for caption generation. They follow an attention based mechanism to learn where to focus in the encoded video frames during caption generation. Although the attention-based approaches have achieved excellent results, they directly link visual features to textual captions and ignore the rich intermediate/high-level video concepts such as people, objects, scenes, and actions. In this paper, we firstly propose to obtain a better event representation that discriminates between events nearly ending at the same time by applying an attention based fusion. Where hidden states from a bi-directional LSTM sequence video encoder, which encodes past and future surrounding context information of a detected event are fused along with its visual (R3D) features. Secondly, we propose to explicitly extract bi-modal semantic concepts (nouns and verbs) from a detected event segment and equilibrate the contributions from the proposed event representation and the semantic concepts dynamically using a gating mechanism while captioning. Experimental results demonstrates that our proposed attention based fusion is better in representing an event for captioning. Also involving semantic concepts improves captioning performance.

查看原文本刊更多论文

基于门控机制的基于双模态语义特征的双向时间上下文融合

密集视频字幕包括检测有趣的事件，并为未修剪视频中的每个事件生成文本描述。许多机器智能应用，如视频摘要、搜索和检索、支持盲人残疾人的自动视频字幕，都受益于自动化密集字幕生成器。最近的工作尝试使用编码器-解码器神经网络框架，该框架使用3D-CNN作为表示检测到的事件帧的编码器，并使用RNN作为生成标题的解码器。他们遵循一种基于注意力的机制来学习在字幕生成过程中在编码视频帧中关注的位置。尽管基于注意力的方法取得了优异的效果，但它们直接将视觉特征与文本标题联系起来，而忽略了丰富的中/高级视频概念，如人、物体、场景和动作。在本文中，我们首先提出了一种基于注意力融合的方法来获得一个更好的事件表示，该事件表示可以区分同时接近结束的事件。其中，来自双向LSTM序列视频编码器的隐藏状态与其视觉(R3D)特征融合在一起，该编码器编码检测事件的过去和未来周围上下文信息。其次，我们提出从检测到的事件片段中显式提取双模态语义概念(名词和动词)，并在字幕时使用门控机制动态平衡所提出的事件表示和语义概念的贡献。实验结果表明，我们提出的基于注意力的融合在表示事件标题方面效果较好。此外，涉及语义概念可以提高字幕的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Intelligent Computing and Information Sciences

自引率

0.00%

发文量