Dongyang Liu , Xuejian Liang , Yunxiao Qi , Yunqiao Xi , Jing Jin , Junping Zhang
{"title":"VLPRSDet:用于遥感目标检测的视觉语言预训练模型","authors":"Dongyang Liu , Xuejian Liang , Yunxiao Qi , Yunqiao Xi , Jing Jin , Junping Zhang","doi":"10.1016/j.neucom.2025.131712","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, numerous excellent vision-language models have emerged in the field of computer vision. These models have demonstrated strong zero-shot detection capabilities and better accuracy after fine-tuning on new datasets in the field of object detection. However, when these models are directly applied to the field of remote sensing, their performance is less than satisfactory. To address this problem, a novel vision-language pretrained model specifically tailored for remote sensing object detection task is proposed. Firstly, we create a new dataset composed of object-text pairs by collecting a large amount of remote sensing image object detection data to train the proposed model. Then, by integrating the CLIP model in the field of remote sensing with the YOLO detector, we propose a vision-language pretrained model for remote sensing object detection (VLPRSDet). VLPRSDet achieves enhanced fusion of visual and textual features through a vision language path aggregation network, and then aligns visual embeddings and textual embeddings through Region Text Matching to achieve the alignment between object regions and text. Experimental results indicate that the proposed VLPRSDet exhibits robust zero-shot capabilities in the field of remote sensing object detection, and can achieve superior detection accuracy after fine-tuning on specific datasets. Specifically, after fine-tuning, VLPRSDet can achieve 76.2 % mAP on the DIOR dataset and 94.2 % mAP on the HRRSD dataset. The code and dataset will be released at <span><span>https://github.com/dyl96/VLPRSDet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131712"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VLPRSDet: A vision–language pretrained model for remote sensing object detection\",\"authors\":\"Dongyang Liu , Xuejian Liang , Yunxiao Qi , Yunqiao Xi , Jing Jin , Junping Zhang\",\"doi\":\"10.1016/j.neucom.2025.131712\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, numerous excellent vision-language models have emerged in the field of computer vision. These models have demonstrated strong zero-shot detection capabilities and better accuracy after fine-tuning on new datasets in the field of object detection. However, when these models are directly applied to the field of remote sensing, their performance is less than satisfactory. To address this problem, a novel vision-language pretrained model specifically tailored for remote sensing object detection task is proposed. Firstly, we create a new dataset composed of object-text pairs by collecting a large amount of remote sensing image object detection data to train the proposed model. Then, by integrating the CLIP model in the field of remote sensing with the YOLO detector, we propose a vision-language pretrained model for remote sensing object detection (VLPRSDet). VLPRSDet achieves enhanced fusion of visual and textual features through a vision language path aggregation network, and then aligns visual embeddings and textual embeddings through Region Text Matching to achieve the alignment between object regions and text. Experimental results indicate that the proposed VLPRSDet exhibits robust zero-shot capabilities in the field of remote sensing object detection, and can achieve superior detection accuracy after fine-tuning on specific datasets. Specifically, after fine-tuning, VLPRSDet can achieve 76.2 % mAP on the DIOR dataset and 94.2 % mAP on the HRRSD dataset. The code and dataset will be released at <span><span>https://github.com/dyl96/VLPRSDet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131712\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225023847\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023847","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
VLPRSDet: A vision–language pretrained model for remote sensing object detection
Recently, numerous excellent vision-language models have emerged in the field of computer vision. These models have demonstrated strong zero-shot detection capabilities and better accuracy after fine-tuning on new datasets in the field of object detection. However, when these models are directly applied to the field of remote sensing, their performance is less than satisfactory. To address this problem, a novel vision-language pretrained model specifically tailored for remote sensing object detection task is proposed. Firstly, we create a new dataset composed of object-text pairs by collecting a large amount of remote sensing image object detection data to train the proposed model. Then, by integrating the CLIP model in the field of remote sensing with the YOLO detector, we propose a vision-language pretrained model for remote sensing object detection (VLPRSDet). VLPRSDet achieves enhanced fusion of visual and textual features through a vision language path aggregation network, and then aligns visual embeddings and textual embeddings through Region Text Matching to achieve the alignment between object regions and text. Experimental results indicate that the proposed VLPRSDet exhibits robust zero-shot capabilities in the field of remote sensing object detection, and can achieve superior detection accuracy after fine-tuning on specific datasets. Specifically, after fine-tuning, VLPRSDet can achieve 76.2 % mAP on the DIOR dataset and 94.2 % mAP on the HRRSD dataset. The code and dataset will be released at https://github.com/dyl96/VLPRSDet.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.