mgr - miss：基于多模态交互和视频描述语义监督的更多地面真相检索

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-07-16 DOI:10.1016/j.neunet.2025.107817

Jiayu Zhang , Pengjie Tang , Yunlan Tan , Hanli Wang

{"title":"mgr - miss：基于多模态交互和视频描述语义监督的更多地面真相检索","authors":"Jiayu Zhang , Pengjie Tang , Yunlan Tan , Hanli Wang","doi":"10.1016/j.neunet.2025.107817","DOIUrl":null,"url":null,"abstract":"<div><div>Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual-textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"192 ","pages":"Article 107817"},"PeriodicalIF":6.3000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MGTR-MISS: More Ground Truth Retrieving based Multimodal Interaction and Semantic Supervision for video description\",\"authors\":\"Jiayu Zhang , Pengjie Tang , Yunlan Tan , Hanli Wang\",\"doi\":\"10.1016/j.neunet.2025.107817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual-textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"192 \",\"pages\":\"Article 107817\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025006975\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025006975","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

用准确的词汇和合适的句型来描述一段视频既有趣又具有挑战性。近年来，人们提出了一些优秀的模型来生成流畅且语义丰富的视频描述。然而，语言通常不参与编码训练，视觉和语言等不同的模态无法有效互动和准确对齐。本文提出了一种由多模态交互和语义监督组成的新模型mgr - miss，该模型可以在更多的基础事实的帮助下生成更准确、语义更丰富的视频描述。首先从训练集中的基础真语料库中检索更多的外部语言知识，为视频捕获更丰富的语言语义。然后将视觉特征和检索到的语言特征馈送到多模态交互模块中，实现模态之间的有效交互和精确对齐。然后将输出的多模态表示馈送到标题生成器，使用视觉文本注意和语义监督机制进行语言解码。在流行的MSVD， MSR-VTT和VATEX数据集上的实验结果表明，我们提出的MGTR-MISS不仅优于基线模型，而且优于最近最先进的方法。特别是在MSVD和MSR-VTT上，CIDEr性能分别达到111.1和55.0。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MGTR-MISS: More Ground Truth Retrieving based Multimodal Interaction and Semantic Supervision for video description

Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual-textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.