Transformer-Based Multi-layer Feature Aggregation and Rotated Anchor Matching for Oriented Object Detection in Remote Sensing Images

IF 2.9 4区综合性期刊 Q1 Multidisciplinary

Arabian Journal for Science and Engineering Pub Date : 2024-03-22 DOI:10.1007/s13369-024-08892-z

Chuan Jin, Anqi Zheng, Zhaoying Wu, Changqing Tong

{"title":"Transformer-Based Multi-layer Feature Aggregation and Rotated Anchor Matching for Oriented Object Detection in Remote Sensing Images","authors":"Chuan Jin, Anqi Zheng, Zhaoying Wu, Changqing Tong","doi":"10.1007/s13369-024-08892-z","DOIUrl":null,"url":null,"abstract":"<p>Object detection has made significant progress in computer vision. However, challenges remain in detecting small, arbitrarily oriented, and densely distributed objects, especially in aerial remote sensing images. This paper presents MATDet, an end-to-end encoder-decoder detection network based on the Transformer designed for oriented object detection. The network employs multi-layer feature aggregation and rotated anchor matching methods to improve oriented small and densely distributed object detection accuracy. Specifically, the encoder is responsible for encoding labeled image blocks using convolutional neural network (CNN) feature maps. It efficiently fuses these blocks with higher resolution multi-scale features through cross-layer connections, facilitating the extraction of global contextual information. The decoder then performs an upsampling of the encoded features, effectively recovering the full spatial resolution of the feature maps to capture essential local–global semantic features for accurate object localization. In addition, high quality proposed anchor boxes are generated by refined convolution, and the convolved features are adaptively aligned according to the anchor boxes to reduce redundant computation. The proposed MATDet achieves mAPs of 80.35%, 78.83%, 73.60%, and 98.01% on the DOTAv1.0, DOTAv1.5, DIOR, and HRSC2016 datasets, respectively, proving that it outperforms the baseline model for oriented object detection. This validation confirms the feasibility and effectiveness of the proposed methods.</p>","PeriodicalId":8109,"journal":{"name":"Arabian Journal for Science and Engineering","volume":"30 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arabian Journal for Science and Engineering","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1007/s13369-024-08892-z","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

Abstract

Object detection has made significant progress in computer vision. However, challenges remain in detecting small, arbitrarily oriented, and densely distributed objects, especially in aerial remote sensing images. This paper presents MATDet, an end-to-end encoder-decoder detection network based on the Transformer designed for oriented object detection. The network employs multi-layer feature aggregation and rotated anchor matching methods to improve oriented small and densely distributed object detection accuracy. Specifically, the encoder is responsible for encoding labeled image blocks using convolutional neural network (CNN) feature maps. It efficiently fuses these blocks with higher resolution multi-scale features through cross-layer connections, facilitating the extraction of global contextual information. The decoder then performs an upsampling of the encoded features, effectively recovering the full spatial resolution of the feature maps to capture essential local–global semantic features for accurate object localization. In addition, high quality proposed anchor boxes are generated by refined convolution, and the convolved features are adaptively aligned according to the anchor boxes to reduce redundant computation. The proposed MATDet achieves mAPs of 80.35%, 78.83%, 73.60%, and 98.01% on the DOTAv1.0, DOTAv1.5, DIOR, and HRSC2016 datasets, respectively, proving that it outperforms the baseline model for oriented object detection. This validation confirms the feasibility and effectiveness of the proposed methods.

Abstract Image

查看原文本刊更多论文

基于变换器的多层特征聚合和旋转锚点匹配，用于遥感图像中的定向物体检测

物体检测在计算机视觉领域取得了重大进展。然而，在检测小型、任意定向和密集分布的物体方面仍存在挑战，尤其是在航空遥感图像中。本文介绍的 MATDet 是一种基于变换器的端到端编码器-解码器检测网络，专为定向物体检测而设计。该网络采用多层特征聚合和旋转锚点匹配方法，提高了对小型和密集分布物体的定向检测精度。具体来说，编码器负责使用卷积神经网络（CNN）特征图对标记图像块进行编码。它通过跨层连接将这些图像块与更高分辨率的多尺度特征有效融合，从而促进全局上下文信息的提取。然后，解码器对编码特征进行上采样，有效地恢复特征图的完整空间分辨率，从而捕捉重要的局部-全局语义特征，实现准确的物体定位。此外，还通过精细卷积生成高质量的拟议锚框，并根据锚框对卷积特征进行自适应对齐，以减少冗余计算。提议的 MATDet 在 DOTAv1.0、DOTAv1.5、DIOR 和 HRSC2016 数据集上的 mAP 分别达到了 80.35%、78.83%、73.60% 和 98.01%，证明它在定向物体检测方面优于基线模型。这一验证证实了所提方法的可行性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Arabian Journal for Science and Engineering 综合性期刊-综合性期刊

CiteScore

5.20

自引率

3.40%

发文量

审稿时长

4.3 months

期刊介绍： King Fahd University of Petroleum & Minerals (KFUPM) partnered with Springer to publish the Arabian Journal for Science and Engineering (AJSE). AJSE, which has been published by KFUPM since 1975, is a recognized national, regional and international journal that provides a great opportunity for the dissemination of research advances from the Kingdom of Saudi Arabia, MENA and the world.