{"title":"Toward Realistic Hierarchical Object Detection: Problem, Benchmark, and Solution","authors":"Juexiao Feng;Yuhong Yang;Mengyao Lyu;Tianxiang Hao;Yi-Jie Huang;Yanchun Xie;Yaqian Li;Jungong Han;Liuyu Xiang;Guiguang Ding","doi":"10.1109/TCSVT.2025.3552596","DOIUrl":null,"url":null,"abstract":"With the continuous advancement of deep learning, object detection has made remarkable progress in accurately identifying a wide range of object categories, even within increasingly complex scenes. However, as the number of categories grows, visual concepts naturally organize into a label hierarchy. We contend that existing hierarchical classification and detection methods predominantly prioritize fine-grained prediction, potentially leading to inconsistencies with realistic human perception. From this perspective, we investigate the Hierarchical Object Detection (HOD) problem to better align with real-world perception. To address the lack of benchmarks in the field, we build a large-scale HOD benchmark termed RHOD with open-source datasets, comprising 740 categories. To better align the hierarchical object detectors towards realistic perception, we propose a new evaluation metric named Hierarchical Average Precision (HAP). Furthermore, we present a novel hierarchical object detection method that includes two components, Tree Soft Labeling (TSL) and Hierarchical Extension and Suppression (HES). Our method mitigates the issue of overconfidence in fine-grained predictions, which has been prevalent in previous approaches. We evaluate a range of existing methods on the RHOD benchmark, including plain, hierarchical, and open-vocabulary models. Additionally, we perform comprehensive experiments to assess the performance of our proposed method. The experimental results show that our method achieves state-of-the-art performance on the RHOD benchmark.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9351-9364"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10930933/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
With the continuous advancement of deep learning, object detection has made remarkable progress in accurately identifying a wide range of object categories, even within increasingly complex scenes. However, as the number of categories grows, visual concepts naturally organize into a label hierarchy. We contend that existing hierarchical classification and detection methods predominantly prioritize fine-grained prediction, potentially leading to inconsistencies with realistic human perception. From this perspective, we investigate the Hierarchical Object Detection (HOD) problem to better align with real-world perception. To address the lack of benchmarks in the field, we build a large-scale HOD benchmark termed RHOD with open-source datasets, comprising 740 categories. To better align the hierarchical object detectors towards realistic perception, we propose a new evaluation metric named Hierarchical Average Precision (HAP). Furthermore, we present a novel hierarchical object detection method that includes two components, Tree Soft Labeling (TSL) and Hierarchical Extension and Suppression (HES). Our method mitigates the issue of overconfidence in fine-grained predictions, which has been prevalent in previous approaches. We evaluate a range of existing methods on the RHOD benchmark, including plain, hierarchical, and open-vocabulary models. Additionally, we perform comprehensive experiments to assess the performance of our proposed method. The experimental results show that our method achieves state-of-the-art performance on the RHOD benchmark.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.