GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

IF 12.5 Q1 TRANSPORTATION
Haicheng Liao , Huanming Shen , Zhenning Li , Chengyue Wang , Guofa Li , Yiming Bie , Chengzhong Xu
{"title":"GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models","authors":"Haicheng Liao ,&nbsp;Huanming Shen ,&nbsp;Zhenning Li ,&nbsp;Chengyue Wang ,&nbsp;Guofa Li ,&nbsp;Yiming Bie ,&nbsp;Chengzhong Xu","doi":"10.1016/j.commtr.2023.100116","DOIUrl":null,"url":null,"abstract":"<div><p>In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.</p></div>","PeriodicalId":100292,"journal":{"name":"Communications in Transportation Research","volume":null,"pages":null},"PeriodicalIF":12.5000,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772424723000276/pdfft?md5=a1dbc3e25c6818f6e3d6006770262c6e&pid=1-s2.0-S2772424723000276-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications in Transportation Research","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772424723000276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TRANSPORTATION","Score":null,"Total":0}
引用次数: 0

Abstract

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

GPT-4 强化了自动驾驶的多模态基础:利用大型语言模型的跨模态注意力
在自动驾驶汽车(AV)领域,在视觉环境中准确辨别指挥官意图并执行语言指令是一项重大挑战。本文介绍了一个复杂的编码器-解码器框架,该框架是为解决自动驾驶汽车的视觉接地问题而开发的。我们的语境感知视觉接地(CAVG)模型是一个先进的系统,它集成了五个核心编码器(文本、情感、图像、语境和跨模态)和一个多模态解码器。这种整合使 CAVG 模型能够通过最先进的大语言模型(LLMs)(包括 GPT-4)来捕捉上下文语义和学习人类情感特征。多头跨模态注意力机制和用于注意力调制的特定区域动态(RSD)层的实施加强了 CAVG 的架构。这种架构设计使模型能够有效处理和解释一系列跨模态输入,从而全面了解口头命令与相应视觉场景之间的关联。对 Talk2Car 数据集(现实世界的基准)的经验评估表明,CAVG 在预测准确性和运行效率方面建立了新的标准。值得注意的是,该模型即使在训练数据有限的情况下也能表现出卓越的性能,训练数据占整个数据集的 50% 到 75%。这一特点凸显了它在实际视听应用中的有效性和部署潜力。此外,CAVG 在长文本命令解释、弱光条件、模糊命令上下文、恶劣天气条件和人口稠密的城市环境等具有挑战性的场景中表现出卓越的鲁棒性和适应性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
15.20
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信