Structured Encoding Based on Semantic Disambiguation for Video Captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Cognitive Computation Pub Date : 2024-05-09 DOI:10.1007/s12559-024-10275-3

Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang

{"title":"Structured Encoding Based on Semantic Disambiguation for Video Captioning","authors":"Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang","doi":"10.1007/s12559-024-10275-3","DOIUrl":null,"url":null,"abstract":"<p>Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"1 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10275-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.

Abstract Image

查看原文本刊更多论文

基于语义消歧的视频字幕结构化编码

视频字幕旨在自动生成视频字幕，因其在视频监控和检索方面的广泛应用而备受关注。然而，现有的大多数方法都是通过帧级卷积来提取特征，忽略了物体之间的语义关系，导致无法对视频细节进行编码。为了解决这一问题，我们从人类对世界的认知过程中汲取灵感，提出了一种通过结构化编码进行语义消歧的视频字幕制作方法。首先，通过引入知识图谱构建视频的概念语义图。然后，利用图卷积网络对概念语义图进行关系学习，挖掘对象的语义关系，形成视频的细节编码。针对物体间多种关系的语义模糊性，我们提出了一种利用视频场景语义动态学习最相关关系的方法，从而在语义消歧的基础上构建语义图。最后，我们提出了一种跨领域引导关系学习策略，以避免仅使用字幕作为交叉熵损失所带来的负面影响。基于三个数据集（SSR-VTT、ActivityNet Captions 和 Student Classroom Behavior）的实验表明，我们的方法优于其他方法。结果表明，引入知识图谱对视频中的对象进行常识推理，可以深入编码对象之间的语义关系，从而捕捉视频细节，提高字幕性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES

CiteScore

9.30

自引率

3.70%

发文量

116

审稿时长

>12 weeks

期刊介绍： Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.