BEVUDA++: Geometric-Aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-25 DOI:10.1109/TCSVT.2024.3523049

Rongyu Zhang;Jiaming Liu;Xiaoqi Li;Xiaowei Chi;Dan Wang;Li Du;Yuan Du;Shanghang Zhang

{"title":"BEVUDA++: Geometric-Aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection","authors":"Rongyu Zhang;Jiaming Liu;Xiaoqi Li;Xiaowei Chi;Dan Wang;Li Du;Yuan Du;Shanghang Zhang","doi":"10.1109/TCSVT.2024.3523049","DOIUrl":null,"url":null,"abstract":"Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5109-5122"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10816404/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.

查看原文本刊更多论文

面向多视角三维目标检测的几何感知无监督域自适应

以视觉为中心的鸟瞰（BEV）感知技术在自动驾驶领域具有相当大的前景。最近的研究优先考虑了效率或准确性的提高，但领域转移的问题被忽视了，导致转移时的性能下降。我们确定了现实世界跨领域场景中的主要领域差距，并首次尝试解决针对BEV感知的多视图3D物体检测中的领域适应（DA）挑战。考虑到BEV感知方法的复杂性及其多组件，跨多几何空间（如2D、3D体素、BEV）的域漂移积累对BEV域适应提出了重大挑战。在本文中，我们引入了一个创新的几何感知师生框架bevuda++来解决这个问题，它包括一个可靠的深度教师（RDT）和一个几何一致的学生（GCS）模型。具体而言，RDT有效地将目标激光雷达与可靠的深度预测相结合，以基于不确定性估计生成深度感知信息，增强对理解目标域至关重要的体素和BEV特征的提取。为了协同减少域偏移，GCS将多个空间的特征映射到统一的几何嵌入空间中，从而缩小了两个域之间数据分布的差距。此外，我们引入了一种新的不确定性引导指数移动平均（UEMA），以进一步减少由于先前获得的不确定性指导所引起的域移位而导致的误差积累。为了证明我们提出的方法的优越性，我们在四个跨域场景下进行了全面的实验，确保了BEV 3D目标检测任务的最先进性能，例如，在昼夜适应下，NDS增强12.9%，mAP增强9.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.