{"title":"DinoQuery: Promoting Small 3D Object Detection With Textual Prompt","authors":"Tong Ning;Ke Lu;Xirui Jiang;Hongjuan Pei;Jian Xue","doi":"10.1109/TCSVT.2025.3557950","DOIUrl":null,"url":null,"abstract":"Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8639-8652"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10949216/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.