{"title":"视频字幕的场景边缘GRU","authors":"Xin Hao, F. Zhou, Xiaoyong Li","doi":"10.1109/ITNEC48623.2020.9084781","DOIUrl":null,"url":null,"abstract":"Recurrent neural networks for video caption have recently attracted widespread attention. It is essential for the video captioning task as it is involved in both the encoding phase and the text description generation phase of the video. However, the traditional encoding-decoding method ignores the scene switching in the video during the encoding phase. In this paper, we propose a video encoding scheme that can discover the structure of a video scene, so as to achieve variable length of the flexible encoding for the video. Unlike the classic encoding-decoding scheme, we propose a new GRU unit that recognizes discontinuities between video frames and enables end-to-end training without the need for additional annotation information. We evaluated our approach on two large datasets: the MPII movie description dataset, and the MSVD dataset. Experiments have shown that our method can find the appropriate level representation of the video and improve the best results of the movie description dataset.","PeriodicalId":235524,"journal":{"name":"2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Scene-Edge GRU for Video Caption\",\"authors\":\"Xin Hao, F. Zhou, Xiaoyong Li\",\"doi\":\"10.1109/ITNEC48623.2020.9084781\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recurrent neural networks for video caption have recently attracted widespread attention. It is essential for the video captioning task as it is involved in both the encoding phase and the text description generation phase of the video. However, the traditional encoding-decoding method ignores the scene switching in the video during the encoding phase. In this paper, we propose a video encoding scheme that can discover the structure of a video scene, so as to achieve variable length of the flexible encoding for the video. Unlike the classic encoding-decoding scheme, we propose a new GRU unit that recognizes discontinuities between video frames and enables end-to-end training without the need for additional annotation information. We evaluated our approach on two large datasets: the MPII movie description dataset, and the MSVD dataset. Experiments have shown that our method can find the appropriate level representation of the video and improve the best results of the movie description dataset.\",\"PeriodicalId\":235524,\"journal\":{\"name\":\"2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)\",\"volume\":\"157 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITNEC48623.2020.9084781\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITNEC48623.2020.9084781","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Recurrent neural networks for video caption have recently attracted widespread attention. It is essential for the video captioning task as it is involved in both the encoding phase and the text description generation phase of the video. However, the traditional encoding-decoding method ignores the scene switching in the video during the encoding phase. In this paper, we propose a video encoding scheme that can discover the structure of a video scene, so as to achieve variable length of the flexible encoding for the video. Unlike the classic encoding-decoding scheme, we propose a new GRU unit that recognizes discontinuities between video frames and enables end-to-end training without the need for additional annotation information. We evaluated our approach on two large datasets: the MPII movie description dataset, and the MSVD dataset. Experiments have shown that our method can find the appropriate level representation of the video and improve the best results of the movie description dataset.