Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang
{"title":"Structured Encoding Based on Semantic Disambiguation for Video Captioning","authors":"Bo Sun, Jinyu Tian, Yong Wu, Lunjun Yu, Yuanyan Tang","doi":"10.1007/s12559-024-10275-3","DOIUrl":null,"url":null,"abstract":"<p>Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10275-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.
期刊介绍:
Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.