遥感图像和音频的深度跨模态检索

2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS) Pub Date : 2018-08-01 DOI:10.1109/PRRS.2018.8486338

Gou Mao, Yuan Yuan, Lu Xiaoqiang

{"title":"遥感图像和音频的深度跨模态检索","authors":"Gou Mao, Yuan Yuan, Lu Xiaoqiang","doi":"10.1109/PRRS.2018.8486338","DOIUrl":null,"url":null,"abstract":"Remote sensing image retrieval has many important applications in civilian and military fields, such as disaster monitoring and target detecting. However, the existing research on image retrieval, mainly including to two directions, text based and content based, cannot meet the rapid and convenient needs of some special applications and emergency scenes. Based on text, the retrieval is limited by keyboard inputting because of its lower efficiency for some urgent situations and based on content, it needs an example image as reference, which usually does not exist. Yet speech, as a direct, natural and efficient human-machine interactive way, can make up these shortcomings. Hence, a novel cross-modal retrieval method for remote sensing image and spoken audio is proposed in this paper. We first build a large-scale remote sensing image dataset with plenty of manual annotated spoken audio captions for the cross-modal retrieval task. Then a Deep Visual-Audio Network is designed to directly learn the correspondence of image and audio. And this model integrates feature extracting and multi-modal learning into the same network. Experiments on the proposed dataset verify the effectiveness of our approach and prove that it is feasible for speech-to-image retrieval.","PeriodicalId":197319,"journal":{"name":"2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":"{\"title\":\"Deep Cross-Modal Retrieval for Remote Sensing Image and Audio\",\"authors\":\"Gou Mao, Yuan Yuan, Lu Xiaoqiang\",\"doi\":\"10.1109/PRRS.2018.8486338\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Remote sensing image retrieval has many important applications in civilian and military fields, such as disaster monitoring and target detecting. However, the existing research on image retrieval, mainly including to two directions, text based and content based, cannot meet the rapid and convenient needs of some special applications and emergency scenes. Based on text, the retrieval is limited by keyboard inputting because of its lower efficiency for some urgent situations and based on content, it needs an example image as reference, which usually does not exist. Yet speech, as a direct, natural and efficient human-machine interactive way, can make up these shortcomings. Hence, a novel cross-modal retrieval method for remote sensing image and spoken audio is proposed in this paper. We first build a large-scale remote sensing image dataset with plenty of manual annotated spoken audio captions for the cross-modal retrieval task. Then a Deep Visual-Audio Network is designed to directly learn the correspondence of image and audio. And this model integrates feature extracting and multi-modal learning into the same network. Experiments on the proposed dataset verify the effectiveness of our approach and prove that it is feasible for speech-to-image retrieval.\",\"PeriodicalId\":197319,\"journal\":{\"name\":\"2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"43\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRRS.2018.8486338\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRRS.2018.8486338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

摘要

遥感图像检索在灾害监测、目标探测等民用和军事领域有着重要的应用。然而，现有的图像检索研究主要包括基于文本和基于内容两个方向，不能满足一些特殊应用和应急场景快速便捷的需求。基于文本的检索由于在某些紧急情况下效率较低而受到键盘输入的限制，而基于内容的检索则需要一个通常不存在的示例图像作为参考。而语音作为一种直接、自然、高效的人机交互方式，可以弥补这些不足。为此，本文提出了一种新的遥感图像和语音的跨模态检索方法。我们首先建立了一个大规模的遥感图像数据集，其中包含大量手动注释的语音字幕，用于跨模式检索任务。然后设计了一个深度视音频网络，直接学习图像和音频的对应关系。该模型将特征提取和多模态学习集成到同一个网络中。在该数据集上的实验验证了该方法的有效性，并证明了该方法对语音到图像的检索是可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Cross-Modal Retrieval for Remote Sensing Image and Audio

Remote sensing image retrieval has many important applications in civilian and military fields, such as disaster monitoring and target detecting. However, the existing research on image retrieval, mainly including to two directions, text based and content based, cannot meet the rapid and convenient needs of some special applications and emergency scenes. Based on text, the retrieval is limited by keyboard inputting because of its lower efficiency for some urgent situations and based on content, it needs an example image as reference, which usually does not exist. Yet speech, as a direct, natural and efficient human-machine interactive way, can make up these shortcomings. Hence, a novel cross-modal retrieval method for remote sensing image and spoken audio is proposed in this paper. We first build a large-scale remote sensing image dataset with plenty of manual annotated spoken audio captions for the cross-modal retrieval task. Then a Deep Visual-Audio Network is designed to directly learn the correspondence of image and audio. And this model integrates feature extracting and multi-modal learning into the same network. Experiments on the proposed dataset verify the effectiveness of our approach and prove that it is feasible for speech-to-image retrieval.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)

自引率

0.00%

发文量