{"title":"基于多模态特征的增强视频字幕生成","authors":"Xuefei Huang, Wei Ke, Hao Sheng","doi":"10.1109/UV56588.2022.10185501","DOIUrl":null,"url":null,"abstract":"Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields — computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.","PeriodicalId":211011,"journal":{"name":"2022 6th International Conference on Universal Village (UV)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhanced Video Caption Generation Based on Multimodal Features\",\"authors\":\"Xuefei Huang, Wei Ke, Hao Sheng\",\"doi\":\"10.1109/UV56588.2022.10185501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields — computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.\",\"PeriodicalId\":211011,\"journal\":{\"name\":\"2022 6th International Conference on Universal Village (UV)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 6th International Conference on Universal Village (UV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/UV56588.2022.10185501\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 6th International Conference on Universal Village (UV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UV56588.2022.10185501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhanced Video Caption Generation Based on Multimodal Features
Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields — computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.