{"title":"Video description method with fusion of instance-aware temporal features","authors":"Ju Huang, He Yan, Lingkun Liu, Yuhan Liu","doi":"10.1117/12.3000765","DOIUrl":null,"url":null,"abstract":"There are still challenges in the field of video understanding today, especially how to use natural language to describe the visual content in videos. Existing video encoder-decoder models struggle to extract deep semantic information and effectively understand the complex contextual semantics in a video sequence. Furthermore, different visual elements in the video contribute differently to the generation of video text descriptions. In this paper, we propose a video description method that fuses instance-aware temporal features. We extract local features of instances on the temporal sequence to enhance perception of temporal instances. We also employ spatial attention to perform weighted fusion of temporal features. Finally, we use bidirectional long short-term memory networks to encode the contextual semantic information of the video sequence, thereby helping to generate higher quality descriptive text. Experimental results on two public datasets demonstrate that our method achieves good performance on various evaluation metrics.","PeriodicalId":210802,"journal":{"name":"International Conference on Image Processing and Intelligent Control","volume":"61 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Image Processing and Intelligent Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.3000765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
There are still challenges in the field of video understanding today, especially how to use natural language to describe the visual content in videos. Existing video encoder-decoder models struggle to extract deep semantic information and effectively understand the complex contextual semantics in a video sequence. Furthermore, different visual elements in the video contribute differently to the generation of video text descriptions. In this paper, we propose a video description method that fuses instance-aware temporal features. We extract local features of instances on the temporal sequence to enhance perception of temporal instances. We also employ spatial attention to perform weighted fusion of temporal features. Finally, we use bidirectional long short-term memory networks to encode the contextual semantic information of the video sequence, thereby helping to generate higher quality descriptive text. Experimental results on two public datasets demonstrate that our method achieves good performance on various evaluation metrics.