{"title":"视频字幕的多特征融合细化网络","authors":"Guangbin Wang, Jixiang Du, Hongbo Zhang","doi":"10.1080/0952813X.2021.1883745","DOIUrl":null,"url":null,"abstract":"ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":"13 1","pages":"483 - 497"},"PeriodicalIF":1.7000,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multi-feature fusion refine network for video captioning\",\"authors\":\"Guangbin Wang, Jixiang Du, Hongbo Zhang\",\"doi\":\"10.1080/0952813X.2021.1883745\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.\",\"PeriodicalId\":15677,\"journal\":{\"name\":\"Journal of Experimental & Theoretical Artificial Intelligence\",\"volume\":\"13 1\",\"pages\":\"483 - 497\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2021-02-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Experimental & Theoretical Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1080/0952813X.2021.1883745\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1883745","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multi-feature fusion refine network for video captioning
ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.
期刊介绍:
Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research.
The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following:
• cognitive science
• games
• learning
• knowledge representation
• memory and neural system modelling
• perception
• problem-solving