{"title":"采用从粗到细的方法实现高效、准确的遥感图像-文本检索","authors":"Wenqian Zhou;Hanlin Wu;Pei Deng","doi":"10.1109/LGRS.2024.3494543","DOIUrl":null,"url":null,"abstract":"Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a tradeoff between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a postprocessing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at \n<uri>https://github.com/ZhWenQian/CFITR</uri>\n.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward Efficient and Accurate Remote Sensing Image–Text Retrieval With a Coarse-to-Fine Approach\",\"authors\":\"Wenqian Zhou;Hanlin Wu;Pei Deng\",\"doi\":\"10.1109/LGRS.2024.3494543\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a tradeoff between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a postprocessing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at \\n<uri>https://github.com/ZhWenQian/CFITR</uri>\\n.\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"22 \",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10747393/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10747393/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Toward Efficient and Accurate Remote Sensing Image–Text Retrieval With a Coarse-to-Fine Approach
Existing remote sensing (RS) image-text retrieval methods generally fall into two categories: dual-stream approaches and single-stream approaches. Dual-stream models are efficient but often lack sufficient interaction between visual and textual modalities, while single-stream models offer high accuracy but suffer from prolonged inference time. To pursue a tradeoff between efficiency and accuracy, we propose a novel coarse-to-fine image-text retrieval (CFITR) framework that integrates both dual-stream and single-stream architectures into a two-stage retrieval process. Our method begins with a dual-stream hashing module (DSHM) to perform coarse retrieval by leveraging precomputed hash codes for efficiency. In the subsequent fine retrieval stage, a single-stream module (SSM) refines these results using a joint transformer to improve accuracy through enhanced cross-modal interactions. We introduce a local feature enhancement module (LFEM) based on convolutions to capture detailed local features and a postprocessing similarity reranking (PPSR) algorithm that optimizes retrieval results without additional training. Extensive experiments on the RSICD and RSITMD datasets demonstrate that our CFITR framework significantly improves retrieval accuracy and supports real-time performance. Our code is publicly available at
https://github.com/ZhWenQian/CFITR
.