Multi-feature fusion refine network for video captioning

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Experimental & Theoretical Artificial Intelligence Pub Date : 2021-02-23 DOI:10.1080/0952813X.2021.1883745

Guangbin Wang, Jixiang Du, Hongbo Zhang

{"title":"Multi-feature fusion refine network for video captioning","authors":"Guangbin Wang, Jixiang Du, Hongbo Zhang","doi":"10.1080/0952813X.2021.1883745","DOIUrl":null,"url":null,"abstract":"ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":"13 1","pages":"483 - 497"},"PeriodicalIF":1.7000,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1883745","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 3

Abstract

ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.

查看原文本刊更多论文

视频字幕的多特征融合细化网络

用自然语言描述视频内容是视频理解的重要组成部分。它不仅需要理解视频中的空间信息，还需要捕捉视频中的运动信息。同时，视频字幕是视觉与语言的跨模态问题。传统的视频字幕方法遵循将视频转换为句子的编码器-解码器框架。但忽略了从句子到视频的语义对齐。因此，寻找一种判别性的视觉表示以及缩小视频和文本之间的语义差距对生成准确的句子有很大的影响。本文提出了一种基于多特征融合细化网络(MFRN)的方法，该方法不仅可以利用多特征融合捕获空间信息和运动信息，还可以通过设计一个细化器来探索句子到视频流，从而更好地实现不同模型的语义对齐。该方法的主要新颖之处和优势在于:(1)多特征融合:利用分别在ImageNet和Kinetic上预训练的二维卷积神经网络和三维卷积神经网络分别构建空间信息和运动信息，然后进行融合以获得更好的视觉表现。(2)语义对齐细化器(semantic alignment refiner):对解码器进行约束，再现视频特征，缩小不同模态之间的语义差距。在两个广泛使用的数据集上的实验表明，我们的方法在BLEU@4、METEOR、ROUGE和CIDEr指标方面实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Experimental & Theoretical Artificial Intelligence 工程技术-计算机：人工智能

CiteScore

6.10

自引率

4.50%

发文量

审稿时长

>12 weeks

期刊介绍： Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research. The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following: • cognitive science • games • learning • knowledge representation • memory and neural system modelling • perception • problem-solving