DinoQuery: Promoting Small 3D Object Detection With Textual Prompt

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI:10.1109/TCSVT.2025.3557950

Tong Ning;Ke Lu;Xirui Jiang;Hongjuan Pei;Jian Xue

{"title":"DinoQuery: Promoting Small 3D Object Detection With Textual Prompt","authors":"Tong Ning;Ke Lu;Xirui Jiang;Hongjuan Pei;Jian Xue","doi":"10.1109/TCSVT.2025.3557950","DOIUrl":null,"url":null,"abstract":"Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8639-8652"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10949216/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.

查看原文本刊更多论文

DinoQuery：用文本提示促进小型3D物体检测

基于查询的三维目标检测由于能够在保持较低的计算成本的同时获得良好的性能，在自动驾驶的应用中取得了显著的成功。然而，它仍然难以可靠地检测自行车和行人等小物体。为了解决这个问题，本文介绍了一种新的基于稀疏查询的方法，称为DinoQuery。这种方法利用ground - dino和文本提示来选择小尺寸对象并生成2D类别感知查询。然后，通过将每个采样查询与其各自的3D位置、方向和大小关联起来，将这些2D类别感知查询与2D全局查询结合起来，提升为3D查询。这些3D查询以及2D查询的有效性通过综合对比学习（CCL）机制进行验证。这是通过将所有2D和3D查询与其各自的2D和3D地面真值标签对齐，并计算相似度以选择真阳性和假阳性查询来实现的。然后基于几何和语义相似度引入对比损失来增强真正查询和削弱假正查询。DinoQuery在nuScenes数据集上进行了测试，显示出优异的性能。值得注意的是，我们的方法在NDS上的最大增幅为3.2%，在mAP上的增幅为3.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.