Toward Realistic Hierarchical Object Detection: Problem, Benchmark, and Solution

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-18 DOI:10.1109/TCSVT.2025.3552596

Juexiao Feng;Yuhong Yang;Mengyao Lyu;Tianxiang Hao;Yi-Jie Huang;Yanchun Xie;Yaqian Li;Jungong Han;Liuyu Xiang;Guiguang Ding

{"title":"Toward Realistic Hierarchical Object Detection: Problem, Benchmark, and Solution","authors":"Juexiao Feng;Yuhong Yang;Mengyao Lyu;Tianxiang Hao;Yi-Jie Huang;Yanchun Xie;Yaqian Li;Jungong Han;Liuyu Xiang;Guiguang Ding","doi":"10.1109/TCSVT.2025.3552596","DOIUrl":null,"url":null,"abstract":"With the continuous advancement of deep learning, object detection has made remarkable progress in accurately identifying a wide range of object categories, even within increasingly complex scenes. However, as the number of categories grows, visual concepts naturally organize into a label hierarchy. We contend that existing hierarchical classification and detection methods predominantly prioritize fine-grained prediction, potentially leading to inconsistencies with realistic human perception. From this perspective, we investigate the Hierarchical Object Detection (HOD) problem to better align with real-world perception. To address the lack of benchmarks in the field, we build a large-scale HOD benchmark termed RHOD with open-source datasets, comprising 740 categories. To better align the hierarchical object detectors towards realistic perception, we propose a new evaluation metric named Hierarchical Average Precision (HAP). Furthermore, we present a novel hierarchical object detection method that includes two components, Tree Soft Labeling (TSL) and Hierarchical Extension and Suppression (HES). Our method mitigates the issue of overconfidence in fine-grained predictions, which has been prevalent in previous approaches. We evaluate a range of existing methods on the RHOD benchmark, including plain, hierarchical, and open-vocabulary models. Additionally, we perform comprehensive experiments to assess the performance of our proposed method. The experimental results show that our method achieves state-of-the-art performance on the RHOD benchmark.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9351-9364"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10930933/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

With the continuous advancement of deep learning, object detection has made remarkable progress in accurately identifying a wide range of object categories, even within increasingly complex scenes. However, as the number of categories grows, visual concepts naturally organize into a label hierarchy. We contend that existing hierarchical classification and detection methods predominantly prioritize fine-grained prediction, potentially leading to inconsistencies with realistic human perception. From this perspective, we investigate the Hierarchical Object Detection (HOD) problem to better align with real-world perception. To address the lack of benchmarks in the field, we build a large-scale HOD benchmark termed RHOD with open-source datasets, comprising 740 categories. To better align the hierarchical object detectors towards realistic perception, we propose a new evaluation metric named Hierarchical Average Precision (HAP). Furthermore, we present a novel hierarchical object detection method that includes two components, Tree Soft Labeling (TSL) and Hierarchical Extension and Suppression (HES). Our method mitigates the issue of overconfidence in fine-grained predictions, which has been prevalent in previous approaches. We evaluate a range of existing methods on the RHOD benchmark, including plain, hierarchical, and open-vocabulary models. Additionally, we perform comprehensive experiments to assess the performance of our proposed method. The experimental results show that our method achieves state-of-the-art performance on the RHOD benchmark.

查看原文本刊更多论文

面向现实的分层对象检测：问题、基准和解决方案

随着深度学习的不断进步，即使在日益复杂的场景中，目标检测也在准确识别广泛的目标类别方面取得了显着进展。然而，随着类别数量的增长，视觉概念自然地组织成标签层次结构。我们认为，现有的分层分类和检测方法主要优先考虑细粒度预测，可能导致与现实的人类感知不一致。从这个角度来看，我们研究了层次对象检测（HOD）问题，以更好地与现实世界的感知保持一致。为了解决该领域缺乏基准的问题，我们用开源数据集构建了一个名为RHOD的大型HOD基准，包括740个类别。为了更好地使分层目标检测器与现实感知相一致，我们提出了一种新的评价指标——分层平均精度（HAP）。在此基础上，提出了一种新的分层目标检测方法，该方法由树状软标记（TSL）和分层扩展与抑制（HES）两部分组成。我们的方法减轻了在细粒度预测中过度自信的问题，这在以前的方法中很普遍。我们在RHOD基准上评估了一系列现有方法，包括普通、分层和开放词汇模型。此外，我们进行了全面的实验来评估我们提出的方法的性能。实验结果表明，我们的方法在RHOD基准上达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.