DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-04-29 DOI:10.1109/TPAMI.2025.3565194

Zeyu Yang;Nan Song;Wei Li;Xiatian Zhu;Li Zhang;Philip H.S. Torr

{"title":"DeepInteraction++: Multi-Modality Interaction for Autonomous Driving","authors":"Zeyu Yang;Nan Song;Wei Li;Xiatian Zhu;Li Zhang;Philip H.S. Torr","doi":"10.1109/TPAMI.2025.3565194","DOIUrl":null,"url":null,"abstract":"Existing top-performance autonomous driving systems typically rely on the <italic>multi-modal fusion</i> strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel <italic>modality interaction</i> strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design <italic>DeepInteraction++</i>, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6749-6763"},"PeriodicalIF":18.6000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10980037/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design DeepInteraction++, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.

查看原文本刊更多论文

DeepInteraction++：自动驾驶的多模态交互

现有的高性能自动驾驶系统通常依赖于多模式融合策略来实现可靠的场景理解。然而，这种设计从根本上受到限制，因为它忽略了特定于模态的优势，最终阻碍了模型的性能。为了解决这一限制，在这项工作中，我们引入了一种新的模态交互策略，该策略允许在整个感知管道中学习和维护单个的单模态表示，从而使其独特的特征能够被利用。为了证明所提出策略的有效性，我们设计了一个多模态交互框架DeepInteraction++，该框架以多模态表征交互编码器和多模态预测交互解码器为特征。具体来说，编码器是作为双流转换器实现的，具有专门的关注操作，用于在独立的特定于模态的表示之间进行信息交换和集成。我们的多模态表征学习结合了以对象为中心、精确采样为基础的特征对齐和全局密集信息传播，这对于更具挑战性的规划任务至关重要。该解码器旨在通过以统一的模态不可知的方式交替聚合来自不同表示的信息来迭代地改进预测，实现多模态预测交互。大量的实验证明了该框架在3D目标检测和端到端自动驾驶任务上的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量