Adaptive semantic guidance network for video captioning

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI:10.1016/j.cviu.2024.104255

Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi

{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":null,"url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003369","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.

查看原文本刊更多论文

视频字幕的自适应语义引导网络

视频字幕的目的是用自然语言描述视频内容，有效整合视觉和文本信息是生成准确字幕的关键。然而，我们发现现有的方法过度依赖于通过训练获得的文本语言先验信息，导致模型倾向于输出高频固定短语。为了解决上述问题，我们从多模态输入中提取高质量的语义信息，然后构建语义引导机制，以适应视觉语义和文本语义对字幕生成的贡献。我们提出了一种用于视频字幕的自适应语义引导网络（ASGNet）。该网络由语义增强编码器（SEE）和自适应控制解码器（ACD）组成。具体来说，SEE通过从视觉和文本中探索丰富的语义信息，帮助模型获得高质量的语义表示。ACD动态调整词生成中视觉和文本语义的贡献权重，引导模型自适应地关注正确的语义信息。这两个模块一起工作，帮助模型克服了过度依赖语言先验的问题，从而产生更准确的视频字幕。最后，我们在常用的视频字幕数据集上进行了大量的实验。MSVD和MSR-VTT达到了最先进的水平，优厨i也取得了良好的性能。这些实验充分验证了我们方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems