{"title":"Incorporating the Graph Representation of Video and Text into Video Captioning","authors":"Min Lu, Yuan Li","doi":"10.1109/ICTAI56018.2022.00065","DOIUrl":null,"url":null,"abstract":"Video captioning is to translate the video content into the textual descriptions. In the encoding phase, the existing approaches encode the irrelevant background and uncorrelated visual objects into visual features. That leads to semantic aberration between the visual features and the expected textual caption. In the decoding phase, the word-by-word prediction infers the next word only from the previously generated caption. That local text context is insufficient for word prediction. To tackle the above two issues, the representations of video and text stem from the convolution on two graphs. The convolution on the video graph distills the visual features by filtering the irrelevant background and uncorrelated salient objects. The key issue is to figure out the similar videos according to the video semantic feature. The word graph is constructed to help incorporate global neighborhood among words into word representation. That word global neigh-borhood serves as the global text context and compensates the local text context. Results on two benchmark datasets show the advantage of the proposed method. Experimental analysis is also conducted to verify the effectiveness of the proposed modules.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Video captioning is to translate the video content into the textual descriptions. In the encoding phase, the existing approaches encode the irrelevant background and uncorrelated visual objects into visual features. That leads to semantic aberration between the visual features and the expected textual caption. In the decoding phase, the word-by-word prediction infers the next word only from the previously generated caption. That local text context is insufficient for word prediction. To tackle the above two issues, the representations of video and text stem from the convolution on two graphs. The convolution on the video graph distills the visual features by filtering the irrelevant background and uncorrelated salient objects. The key issue is to figure out the similar videos according to the video semantic feature. The word graph is constructed to help incorporate global neighborhood among words into word representation. That word global neigh-borhood serves as the global text context and compensates the local text context. Results on two benchmark datasets show the advantage of the proposed method. Experimental analysis is also conducted to verify the effectiveness of the proposed modules.