Multi-feature fusion refine network for video captioning

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Guangbin Wang, Jixiang Du, Hongbo Zhang
{"title":"Multi-feature fusion refine network for video captioning","authors":"Guangbin Wang, Jixiang Du, Hongbo Zhang","doi":"10.1080/0952813X.2021.1883745","DOIUrl":null,"url":null,"abstract":"ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1883745","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 3

Abstract

ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.
视频字幕的多特征融合细化网络
用自然语言描述视频内容是视频理解的重要组成部分。它不仅需要理解视频中的空间信息,还需要捕捉视频中的运动信息。同时,视频字幕是视觉与语言的跨模态问题。传统的视频字幕方法遵循将视频转换为句子的编码器-解码器框架。但忽略了从句子到视频的语义对齐。因此,寻找一种判别性的视觉表示以及缩小视频和文本之间的语义差距对生成准确的句子有很大的影响。本文提出了一种基于多特征融合细化网络(MFRN)的方法,该方法不仅可以利用多特征融合捕获空间信息和运动信息,还可以通过设计一个细化器来探索句子到视频流,从而更好地实现不同模型的语义对齐。该方法的主要新颖之处和优势在于:(1)多特征融合:利用分别在ImageNet和Kinetic上预训练的二维卷积神经网络和三维卷积神经网络分别构建空间信息和运动信息,然后进行融合以获得更好的视觉表现。(2)语义对齐细化器(semantic alignment refiner):对解码器进行约束,再现视频特征,缩小不同模态之间的语义差距。在两个广泛使用的数据集上的实验表明,我们的方法在BLEU@4、METEOR、ROUGE和CIDEr指标方面实现了最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.10
自引率
4.50%
发文量
89
审稿时长
>12 weeks
期刊介绍: Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research. The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following: • cognitive science • games • learning • knowledge representation • memory and neural system modelling • perception • problem-solving
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信