{"title":"Detector With Classifier2: An End-to-End Multi-Stream Feature Aggregation Network for Fine-Grained Object Detection in Remote Sensing Images","authors":"Shangdong Zheng;Zebin Wu;Yang Xu;Chengxun He;Zhihui Wei","doi":"10.1109/TIP.2025.3563708","DOIUrl":null,"url":null,"abstract":"Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% <inline-formula> <tex-math>$v.s.~36.7$ </tex-math></inline-formula>%) without any complicated post-processing strategies. Source codes of the proposed methods are available at <uri>https://github.com/zhengshangdong/DC2</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2707-2720"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10980167/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% $v.s.~36.7$ %) without any complicated post-processing strategies. Source codes of the proposed methods are available at https://github.com/zhengshangdong/DC2
细粒度对象检测(FGOD)基本上包括两个主要任务:对象检测和细粒度分类。在自然场景中,大多数FGOD方法受益于更高的实例分辨率和更少的环境变化,这通常与后一项任务相关。在本文中,我们提出了一个名为Detector with Classifier2 (DC2)的统一范式,该范式通过明确地考虑对象检测和细粒度分类任务的端到端集成而不是优先考虑一个方面,提供了一个整体范式。最初,我们的检测子网络仅限于确定提案是否为粗类别,而不深入到具体的子类别。此外,为了减少冗余的像素级计算,我们提出了一个实例级特征增强(IFE)模块来建模提案之间的语义相似度,这为在遥感图像中定位更多的实例提供了巨大的潜力。在获得粗检测预测后,我们进一步构建分类子网络,该分类子网络建立在前一个分支的基础上,以确定上述预测的具体子类别。重要的是,检测网络是在完整的图像上进行的,而分类网络对检测到的区域进行二次建模。这些操作可以表示为每个实例的全局上下文信息和局部内在线索提取。因此,我们提出了一个多流特征聚合(MSFA)模块来集成全局流语义信息和本地流判别线索。我们的整个DC2网络采用了端到端的学习方式,有效地挖掘了检测网络和细粒度分类网络之间的内在相关性。我们在SAT-MTB和HRSC2016两个基准数据集上评估了DC2网络的性能。重要的是,与最近的工作相比,我们的方法获得了新的最先进的结果(在SAT-MTB上获得了大约7%的mAP增益),并且在没有任何复杂的后处理策略的情况下显著提高了基线(43.2% vs .~ 36.7%)。建议的方法的源代码可在https://github.com/zhengshangdong/DC2上获得