Incorporating the Graph Representation of Video and Text into Video Captioning

2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI) Pub Date : 2022-10-01 DOI:10.1109/ICTAI56018.2022.00065

Min Lu, Yuan Li

{"title":"Incorporating the Graph Representation of Video and Text into Video Captioning","authors":"Min Lu, Yuan Li","doi":"10.1109/ICTAI56018.2022.00065","DOIUrl":null,"url":null,"abstract":"Video captioning is to translate the video content into the textual descriptions. In the encoding phase, the existing approaches encode the irrelevant background and uncorrelated visual objects into visual features. That leads to semantic aberration between the visual features and the expected textual caption. In the decoding phase, the word-by-word prediction infers the next word only from the previously generated caption. That local text context is insufficient for word prediction. To tackle the above two issues, the representations of video and text stem from the convolution on two graphs. The convolution on the video graph distills the visual features by filtering the irrelevant background and uncorrelated salient objects. The key issue is to figure out the similar videos according to the video semantic feature. The word graph is constructed to help incorporate global neighborhood among words into word representation. That word global neigh-borhood serves as the global text context and compensates the local text context. Results on two benchmark datasets show the advantage of the proposed method. Experimental analysis is also conducted to verify the effectiveness of the proposed modules.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video captioning is to translate the video content into the textual descriptions. In the encoding phase, the existing approaches encode the irrelevant background and uncorrelated visual objects into visual features. That leads to semantic aberration between the visual features and the expected textual caption. In the decoding phase, the word-by-word prediction infers the next word only from the previously generated caption. That local text context is insufficient for word prediction. To tackle the above two issues, the representations of video and text stem from the convolution on two graphs. The convolution on the video graph distills the visual features by filtering the irrelevant background and uncorrelated salient objects. The key issue is to figure out the similar videos according to the video semantic feature. The word graph is constructed to help incorporate global neighborhood among words into word representation. That word global neigh-borhood serves as the global text context and compensates the local text context. Results on two benchmark datasets show the advantage of the proposed method. Experimental analysis is also conducted to verify the effectiveness of the proposed modules.

查看原文本刊更多论文

视频和文本的图形表示与视频字幕的结合

视频字幕是将视频内容翻译成文字描述。在编码阶段，现有的方法将不相关的背景和不相关的视觉对象编码为视觉特征。这将导致视觉特征和预期文本标题之间的语义偏差。在解码阶段，逐字预测仅从先前生成的标题中推断出下一个单词。本地文本上下文不足以进行单词预测。为了解决上述两个问题，视频和文本的表示源于两个图上的卷积。对视频图进行卷积，通过过滤不相关的背景和不相关的突出对象，提取视频图的视觉特征。关键问题是根据视频的语义特征找出相似的视频。构建词图有助于将词之间的全局邻域整合到词表示中。“全球邻域”一词充当了全球文本语境，弥补了局部文本语境。在两个基准数据集上的实验结果表明了该方法的优越性。实验分析验证了所提模块的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)

自引率

0.00%

发文量