{"title":"基于G-AoANet的图像描述生成方法研究","authors":"Pi Qiao, Ruixue Shen, Yuan Li","doi":"10.1145/3573942.3574072","DOIUrl":null,"url":null,"abstract":"Most of the image description generation methods in the attention-based encoder-decoder framework extract local features from images. Despite the relatively high semantic level of local features, it still has two problems to be solved, one is object loss, where some important objects may be lost when generating image descriptions, and the other is prediction error, as an object may be identified in the wrong class. In this paper, a G-AoANet model is proposed to solve the above problems. The model uses an attention mechanism to combine global features with local features. In this way, our model can selectively focus on both object and contextual information, improving the quality of the generated descriptions. Experimental results show that the model improves the initially reported best CIDEr-D and SPICE scores on the MS COCO dataset by 9.3% and 5.1% respectively.","PeriodicalId":103293,"journal":{"name":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on Image Description Generation Method Based on G-AoANet\",\"authors\":\"Pi Qiao, Ruixue Shen, Yuan Li\",\"doi\":\"10.1145/3573942.3574072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most of the image description generation methods in the attention-based encoder-decoder framework extract local features from images. Despite the relatively high semantic level of local features, it still has two problems to be solved, one is object loss, where some important objects may be lost when generating image descriptions, and the other is prediction error, as an object may be identified in the wrong class. In this paper, a G-AoANet model is proposed to solve the above problems. The model uses an attention mechanism to combine global features with local features. In this way, our model can selectively focus on both object and contextual information, improving the quality of the generated descriptions. Experimental results show that the model improves the initially reported best CIDEr-D and SPICE scores on the MS COCO dataset by 9.3% and 5.1% respectively.\",\"PeriodicalId\":103293,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573942.3574072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573942.3574072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research on Image Description Generation Method Based on G-AoANet
Most of the image description generation methods in the attention-based encoder-decoder framework extract local features from images. Despite the relatively high semantic level of local features, it still has two problems to be solved, one is object loss, where some important objects may be lost when generating image descriptions, and the other is prediction error, as an object may be identified in the wrong class. In this paper, a G-AoANet model is proposed to solve the above problems. The model uses an attention mechanism to combine global features with local features. In this way, our model can selectively focus on both object and contextual information, improving the quality of the generated descriptions. Experimental results show that the model improves the initially reported best CIDEr-D and SPICE scores on the MS COCO dataset by 9.3% and 5.1% respectively.