Implicit and explicit commonsense for multi-sentence video captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-07-05 DOI:10.1016/j.cviu.2024.104064

{"title":"Implicit and explicit commonsense for multi-sentence video captioning","authors":"","doi":"10.1016/j.cviu.2024.104064","DOIUrl":null,"url":null,"abstract":"<div><p>Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001450/pdfft?md5=10aaeba9fc35361fc930cce1f7a01fe8&pid=1-s2.0-S1077314224001450-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001450","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset.

查看原文本刊更多论文

多句式视频字幕的隐性和显性常识

现有的密集或段落视频字幕方法依赖于视频的整体表征，并可能与学习到的物体/动作表征相结合，为分层语言解码器提供条件。然而，这些方法从根本上缺乏推理事件进展、因果关系，甚至场景中某些物体的功能所需的常识性知识。为了解决这一局限性，我们提出了一种基于转换器的新型视频字幕模型，该模型同时考虑了隐性（视觉语言和纯语言）和显性（知识库）常识知识。我们的研究表明，这些知识形式无论是单独使用还是结合使用，都能提高字幕的质量。此外，受模仿学习的启发，我们提出了一项新的指令生成任务，其目标是从视频演示中生成一组语言指令。我们利用在 AI2-THOR 环境中生成的 ALFRED 数据集正式确定了这一任务。虽然指令生成在概念上与段落标题相似，但其不同之处在于它表现出更强的对象持久性，以及空间感知和因果关系句子结构。我们的研究表明，我们的常识性知识增强方法在这项任务中取得了显著的改进（在 METEOR 中达到了 57%，在 CIDEr 中达到了 8.5%），在 ActivityNet Captions 数据集中的更传统的视频字幕上也取得了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems