AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation

IF 8.6 Q1 REMOTE SENSING

International journal of applied earth observation and geoinformation : ITC journal Pub Date : 2025-09-01 DOI:10.1016/j.jag.2025.104817

Rui Li, Xiaowei Zhao

{"title":"AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation","authors":"Rui Li, Xiaowei Zhao","doi":"10.1016/j.jag.2025.104817","DOIUrl":null,"url":null,"abstract":"<div><div>As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localise and segment objects based on textual descriptions. While Referring Image Segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from Unmanned Aerial Vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and the Multimodal Large Language Models (MLLM) for generating textual descriptions. Furthermore, we propose Aerial Referring Transformer (AeroReformer), a novel framework for UAV Referring Image Segmentation (UAV-RIS), featuring a Vision-Language Cross-Attention Module (VLCAM) for effective cross-modal understanding and a Rotation-Aware Multi-Scale Fusion (RAMSF) decoder to enhance segmentation accuracy in aerial scenes. Extensive experiments on two newly developed datasets demonstrate the superiority of AeroReformer over existing methods, establishing a new benchmark for UAV-RIS. The datasets and code are publicly available at <span><span>https://github.com/lironui/AeroReformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":73423,"journal":{"name":"International journal of applied earth observation and geoinformation : ITC journal","volume":"143 ","pages":"Article 104817"},"PeriodicalIF":8.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of applied earth observation and geoinformation : ITC journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569843225004649","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REMOTE SENSING","Score":null,"Total":0}

引用次数: 0

Abstract

As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localise and segment objects based on textual descriptions. While Referring Image Segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from Unmanned Aerial Vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and the Multimodal Large Language Models (MLLM) for generating textual descriptions. Furthermore, we propose Aerial Referring Transformer (AeroReformer), a novel framework for UAV Referring Image Segmentation (UAV-RIS), featuring a Vision-Language Cross-Attention Module (VLCAM) for effective cross-modal understanding and a Rotation-Aware Multi-Scale Fusion (RAMSF) decoder to enhance segmentation accuracy in aerial scenes. Extensive experiments on two newly developed datasets demonstrate the superiority of AeroReformer over existing methods, establishing a new benchmark for UAV-RIS. The datasets and code are publicly available at https://github.com/lironui/AeroReformer.

查看原文本刊更多论文

AeroReformer：用于无人机参考图像分割的空中参考变压器

参考分割是一项新颖而具有挑战性的任务，它将计算机视觉和自然语言处理相结合，根据文本描述对对象进行定位和分割。参考图像分割（RIS）已经在自然图像中得到了广泛的研究，但很少有人关注航空图像，特别是来自无人机的图像。无人机图像的独特挑战，包括复杂的空间尺度、遮挡和不同的物体方向，使得现有的RIS方法无效。一个关键的限制是缺乏无人机特定的数据集，因为手动注释像素级掩码和生成文本描述是劳动密集型和耗时的。为了解决这一差距，我们设计了一个自动标记管道，该管道利用已有的无人机分割数据集和多模态大型语言模型（MLLM）来生成文本描述。此外，我们提出了空中参考变压器（AeroReformer），这是一种用于无人机参考图像分割（UAV- ris）的新框架，具有视觉语言交叉注意模块（VLCAM）用于有效的跨模态理解和旋转感知多尺度融合（RAMSF）解码器以提高航拍场景的分割精度。在两个新开发的数据集上进行的大量实验表明，AeroReformer优于现有方法，为无人机- ris建立了新的基准。数据集和代码可在https://github.com/lironui/AeroReformer上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International journal of applied earth observation and geoinformation : ITC journal Global and Planetary Change, Management, Monitoring, Policy and Law, Earth-Surface Processes, Computers in Earth Sciences

CiteScore

12.00

自引率

0.00%

发文量

审稿时长

77 days

期刊介绍： The International Journal of Applied Earth Observation and Geoinformation publishes original papers that utilize earth observation data for natural resource and environmental inventory and management. These data primarily originate from remote sensing platforms, including satellites and aircraft, supplemented by surface and subsurface measurements. Addressing natural resources such as forests, agricultural land, soils, and water, as well as environmental concerns like biodiversity, land degradation, and hazards, the journal explores conceptual and data-driven approaches. It covers geoinformation themes like capturing, databasing, visualization, interpretation, data quality, and spatial uncertainty.