Structured Encoding Based on Semantic Disambiguation for Video Captioning

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang
{"title":"Structured Encoding Based on Semantic Disambiguation for Video Captioning","authors":"Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang","doi":"10.1007/s12559-024-10275-3","DOIUrl":null,"url":null,"abstract":"<p>Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10275-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.

Abstract Image

基于语义消歧的视频字幕结构化编码
视频字幕旨在自动生成视频字幕,因其在视频监控和检索方面的广泛应用而备受关注。然而,现有的大多数方法都是通过帧级卷积来提取特征,忽略了物体之间的语义关系,导致无法对视频细节进行编码。为了解决这一问题,我们从人类对世界的认知过程中汲取灵感,提出了一种通过结构化编码进行语义消歧的视频字幕制作方法。首先,通过引入知识图谱构建视频的概念语义图。然后,利用图卷积网络对概念语义图进行关系学习,挖掘对象的语义关系,形成视频的细节编码。针对物体间多种关系的语义模糊性,我们提出了一种利用视频场景语义动态学习最相关关系的方法,从而在语义消歧的基础上构建语义图。最后,我们提出了一种跨领域引导关系学习策略,以避免仅使用字幕作为交叉熵损失所带来的负面影响。基于三个数据集(SSR-VTT、ActivityNet Captions 和 Student Classroom Behavior)的实验表明,我们的方法优于其他方法。结果表明,引入知识图谱对视频中的对象进行常识推理,可以深入编码对象之间的语义关系,从而捕捉视频细节,提高字幕性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Cognitive Computation
Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES
CiteScore
9.30
自引率
3.70%
发文量
116
审稿时长
>12 weeks
期刊介绍: Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信