Video description method with fusion of instance-aware temporal features

Ju Huang, He Yan, Lingkun Liu, Yuhan Liu
{"title":"Video description method with fusion of instance-aware temporal features","authors":"Ju Huang, He Yan, Lingkun Liu, Yuhan Liu","doi":"10.1117/12.3000765","DOIUrl":null,"url":null,"abstract":"There are still challenges in the field of video understanding today, especially how to use natural language to describe the visual content in videos. Existing video encoder-decoder models struggle to extract deep semantic information and effectively understand the complex contextual semantics in a video sequence. Furthermore, different visual elements in the video contribute differently to the generation of video text descriptions. In this paper, we propose a video description method that fuses instance-aware temporal features. We extract local features of instances on the temporal sequence to enhance perception of temporal instances. We also employ spatial attention to perform weighted fusion of temporal features. Finally, we use bidirectional long short-term memory networks to encode the contextual semantic information of the video sequence, thereby helping to generate higher quality descriptive text. Experimental results on two public datasets demonstrate that our method achieves good performance on various evaluation metrics.","PeriodicalId":210802,"journal":{"name":"International Conference on Image Processing and Intelligent Control","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Image Processing and Intelligent Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.3000765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

There are still challenges in the field of video understanding today, especially how to use natural language to describe the visual content in videos. Existing video encoder-decoder models struggle to extract deep semantic information and effectively understand the complex contextual semantics in a video sequence. Furthermore, different visual elements in the video contribute differently to the generation of video text descriptions. In this paper, we propose a video description method that fuses instance-aware temporal features. We extract local features of instances on the temporal sequence to enhance perception of temporal instances. We also employ spatial attention to perform weighted fusion of temporal features. Finally, we use bidirectional long short-term memory networks to encode the contextual semantic information of the video sequence, thereby helping to generate higher quality descriptive text. Experimental results on two public datasets demonstrate that our method achieves good performance on various evaluation metrics.
融合实例感知时间特征的视频描述方法
目前,视频理解领域仍存在诸多挑战,特别是如何使用自然语言来描述视频中的视觉内容。现有的视频编码器-解码器模型难以提取深度语义信息并有效理解视频序列中复杂的上下文语义。此外,视频中不同的视觉元素对视频文本描述的生成也有不同的贡献。本文提出了一种融合实例感知时间特征的视频描述方法。我们在时间序列上提取实例的局部特征,以增强对时间实例的感知。我们还利用空间注意力对时间特征进行加权融合。最后,我们使用双向长短期记忆网络对视频序列的上下文语义信息进行编码,从而有助于生成更高质量的描述性文本。在两个公共数据集上的实验结果表明,我们的方法在各种评估指标上都取得了良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信