{"title":"基于连续输出神经模型的遥感图像字幕","authors":"R. Ramos, Bruno Martins","doi":"10.1145/3474717.3483631","DOIUrl":null,"url":null,"abstract":"Remote sensing image captioning involves generating a concise textual description for an input aerial image. Most previous methods are based on neural encoder-decoder models trained to generate a sequence of discrete outputs with the standard cross-entropy token-level loss. This paper explores an alternative method based on continuous outputs, generating sequences of embedding vectors instead of directly predicting discrete word tokens. We argue that continuous outputs can facilitate the optimization of semantic similarity, as opposed to exact word-by-word matches. It also facilitates the use of loss functions that compare different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compared discrete versus continuous output methods over the RSICD dataset, extensively used in the area. Results show that continuous outputs can indeed lead to better results, and our approach performs competitively with the state-of-the-art model in the area.","PeriodicalId":340759,"journal":{"name":"Proceedings of the 29th International Conference on Advances in Geographic Information Systems","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Remote Sensing Image Captioning with Continuous Output Neural Models\",\"authors\":\"R. Ramos, Bruno Martins\",\"doi\":\"10.1145/3474717.3483631\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Remote sensing image captioning involves generating a concise textual description for an input aerial image. Most previous methods are based on neural encoder-decoder models trained to generate a sequence of discrete outputs with the standard cross-entropy token-level loss. This paper explores an alternative method based on continuous outputs, generating sequences of embedding vectors instead of directly predicting discrete word tokens. We argue that continuous outputs can facilitate the optimization of semantic similarity, as opposed to exact word-by-word matches. It also facilitates the use of loss functions that compare different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compared discrete versus continuous output methods over the RSICD dataset, extensively used in the area. Results show that continuous outputs can indeed lead to better results, and our approach performs competitively with the state-of-the-art model in the area.\",\"PeriodicalId\":340759,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Advances in Geographic Information Systems\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Advances in Geographic Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3474717.3483631\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Advances in Geographic Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474717.3483631","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Remote Sensing Image Captioning with Continuous Output Neural Models
Remote sensing image captioning involves generating a concise textual description for an input aerial image. Most previous methods are based on neural encoder-decoder models trained to generate a sequence of discrete outputs with the standard cross-entropy token-level loss. This paper explores an alternative method based on continuous outputs, generating sequences of embedding vectors instead of directly predicting discrete word tokens. We argue that continuous outputs can facilitate the optimization of semantic similarity, as opposed to exact word-by-word matches. It also facilitates the use of loss functions that compare different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compared discrete versus continuous output methods over the RSICD dataset, extensively used in the area. Results show that continuous outputs can indeed lead to better results, and our approach performs competitively with the state-of-the-art model in the area.