{"title":"Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning","authors":"Anli Liu;Lingwu Meng;Liang Xiao","doi":"10.1109/JSTARS.2024.3487846","DOIUrl":null,"url":null,"abstract":"Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"17 ","pages":"20026-20040"},"PeriodicalIF":4.7000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10737430","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737430/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Remote sensing image captioning (RSIC) is a crucial task in interpreting remote sensing images (RSIs), as it involves describing their content using clear and precise natural language. However, the RSIC encounters difficulties due to the intricate structure and distinctive features of the images, such as the issue of rotational ambiguity. The existence of visually alike objects or areas can result in misidentification. In addition, prioritizing groups of objects with strong relational ties during the captioning process poses a significant challenge. To address these challenges, we propose the visual rotated position encoding transformer for RSIC. First of all, rotation-invariant features and global features are extracted using a multilevel feature extraction (MFE) module. To focus on closely related rotated objects, we design a visual rotated position encoding module, which is incorporated into the transformer encoder to model directional relationships between objects. To distinguish similar features and guide caption generation, we propose a feature enhancement fusion module consisting of feature enhancement and feature fusion. The feature enhancement component adopts a self-attention mechanism to construct fully connected graphs for object features. The feature fusion component integrates global features and word vectors to guide the caption generation process. In addition, we construct an RSI rotated object detection dataset RSIC-ROD and pretrain a rotated object detector. The proposed method demonstrates significant performance improvements on four datasets, showcasing enhanced capabilities in preserving descriptive details, distinguishing similar objects, and accurately capturing object relationships.
期刊介绍:
The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.