Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

IF 4.7 2区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Pub Date : 2024-10-30 DOI:10.1109/JSTARS.2024.3488034

Wei Zhang;Miaoxin Cai;Tong Zhang;Guoqiang Lei;Yin Zhuang;Xuerui Mao

{"title":"Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery","authors":"Wei Zhang;Miaoxin Cai;Tong Zhang;Guoqiang Lei;Yin Zhuang;Xuerui Mao","doi":"10.1109/JSTARS.2024.3488034","DOIUrl":null,"url":null,"abstract":"Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"17 ","pages":"20050-20063"},"PeriodicalIF":4.7000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10738390","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10738390/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.

查看原文本刊更多论文

大力水手从遥感图像中进行多源船舶探测的统一视觉语言模型

船舶探测需要从遥感场景中识别船舶位置。由于不同的成像有效载荷、船舶的不同外观以及复杂的鸟瞰背景干扰，很难建立一个统一的范式来实现多源船舶检测。为解决这一难题，本文利用大型语言模型强大的泛化能力，提出了一种名为 "大力水手 "的统一视觉语言模型，用于从 RS 图像中进行多源船舶检测。具体地说，为了弥补多源图像在船舶检测方面的解释差距，本文设计了一种新颖的统一标注范式，以整合不同的视觉模态和各种船舶检测方式，即水平边界框和定向边界框。随后，设计了混合专家编码器来细化多尺度视觉特征，从而增强视觉感知。然后，为 "大力水手 "开发了一种视觉语言对齐方法，以增强视觉内容与语言内容之间的交互理解能力。此外，还提出了一种指令适应机制，用于将自然场景中预先训练好的视觉语言知识转移到 RS 领域，以进行多源船舶检测。此外，Popeye 还无缝集成了任何分割模型，以实现像素级的船舶分割，而无需额外的训练成本。最后，在新构建的名为 MMShip 的船舶指令数据集上进行了大量实验，结果表明，在零镜头多源各种船舶检测任务中，所提出的 Popeye 优于当前的专家、开放词汇和其他视觉语言模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 地学-成像科学与照相技术

CiteScore

9.30

自引率

10.90%

发文量

563

审稿时长

4.7 months

期刊介绍： The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.