Unified multimodal fusion transformer for few shot object detection for remote sensing images

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2024-06-07 DOI:10.1016/j.inffus.2024.102508

Abdullah Azeem , Zhengzhou Li , Abubakar Siddique , Yuting Zhang , Shangbo Zhou

{"title":"Unified multimodal fusion transformer for few shot object detection for remote sensing images","authors":"Abdullah Azeem , Zhengzhou Li , Abubakar Siddique , Yuting Zhang , Shangbo Zhou","doi":"10.1016/j.inffus.2024.102508","DOIUrl":null,"url":null,"abstract":"<div><p>Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detectors to learn from very limited labeled data. Recent work fuse multi-modalities like image–text pairs to tackle data scarcity but require external region proposal network (RPN) to align cross-modal pairs which leads to a bias towards base classes and insufficient cross-modal contextual learning. To address these problems, we propose a unified multi-modal fusion transformer (UMFT), which extracts visual features from ViT and textual encodings from BERT to align multi-modal representations in an end-to-end manner. Specifically, affinity-guided fusion (AFM) captures semantically related image–text pairs by modeling their affinity relationships to selectively combine most informative pairs. In addition, cross-modal correlation module (CCM) captures discriminative inter-modal patterns between image and text representations and filters out unrelated features to enhance cross-modal alignment. By leveraging AFM to focus on semantic relationships and CCM to refine inter-modal features, the model better aligns multimodal data without RPN. These representations are passed to detection decoder that predicts bounding boxes, probabilities of class and cross-modal attributes. Evaluation of UMFT on benchmark datasets NWPU VHR-10 and DIOR demonstrated its ability to leverage limited image–text training data via dynamic fusion, achieving new state-of-the-art mean average precision (mAP) for few-shot object detection. Our code will be publicly available at <span>https://github.com/abdullah-azeem/umft</span><svg><path></path></svg>.</p></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"111 ","pages":"Article 102508"},"PeriodicalIF":14.7000,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524002860","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detectors to learn from very limited labeled data. Recent work fuse multi-modalities like image–text pairs to tackle data scarcity but require external region proposal network (RPN) to align cross-modal pairs which leads to a bias towards base classes and insufficient cross-modal contextual learning. To address these problems, we propose a unified multi-modal fusion transformer (UMFT), which extracts visual features from ViT and textual encodings from BERT to align multi-modal representations in an end-to-end manner. Specifically, affinity-guided fusion (AFM) captures semantically related image–text pairs by modeling their affinity relationships to selectively combine most informative pairs. In addition, cross-modal correlation module (CCM) captures discriminative inter-modal patterns between image and text representations and filters out unrelated features to enhance cross-modal alignment. By leveraging AFM to focus on semantic relationships and CCM to refine inter-modal features, the model better aligns multimodal data without RPN. These representations are passed to detection decoder that predicts bounding boxes, probabilities of class and cross-modal attributes. Evaluation of UMFT on benchmark datasets NWPU VHR-10 and DIOR demonstrated its ability to leverage limited image–text training data via dynamic fusion, achieving new state-of-the-art mean average precision (mAP) for few-shot object detection. Our code will be publicly available at https://github.com/abdullah-azeem/umft.

查看原文本刊更多论文

统一多模态融合变换器，用于遥感图像的少拍物体检测

物体检测是计算机视觉的一项基本任务，在遥感领域有着广泛的应用，但传统方法严重依赖于难以获得的大型标注数据集，尤其是对于新的物体类别。少量物体检测（FSOD）旨在通过使用检测器从非常有限的标注数据中学习来解决这一问题。最近的工作融合了多模态（如图像-文本对）来解决数据稀缺的问题，但需要外部区域建议网络（RPN）来对齐跨模态对，这导致了对基础类的偏差和跨模态上下文学习的不足。为了解决这些问题，我们提出了一种统一的多模态融合转换器（UMFT），该转换器从 ViT 中提取视觉特征，从 BERT 中提取文本编码，以端到端方式对齐多模态表征。具体来说，亲和力引导融合（AFM）通过对亲和力关系建模来捕捉语义相关的图像-文本对，从而有选择性地将信息量最大的图像-文本对结合在一起。此外，跨模态相关性模块（CCM）可捕捉图像和文本表征之间具有区分性的跨模态模式，并过滤掉不相关的特征，以加强跨模态对齐。通过利用 AFM 来关注语义关系，利用 CCM 来完善跨模态特征，该模型可以更好地对齐多模态数据，而无需 RPN。这些表征被传递给检测解码器，由其预测边界框、类别概率和跨模态属性。在基准数据集 NWPU VHR-10 和 DIOR 上对 UMFT 进行的评估表明，它有能力通过动态融合利用有限的图像-文本训练数据，在少镜头物体检测方面达到新的一流平均精度 (mAP)。我们的代码将在 https://github.com/abdullah-azeem/umft 公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.