Multimodal understanding with GPT-4o to enhance generalizable pedestrian behavior prediction

IF 4.9 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Je-Seok Ham , Jia Huang , Peng Jiang , Jinyoung Moon , Yongjin Kwon , Srikanth Saripalli , Changick Kim
{"title":"Multimodal understanding with GPT-4o to enhance generalizable pedestrian behavior prediction","authors":"Je-Seok Ham ,&nbsp;Jia Huang ,&nbsp;Peng Jiang ,&nbsp;Jinyoung Moon ,&nbsp;Yongjin Kwon ,&nbsp;Srikanth Saripalli ,&nbsp;Changick Kim","doi":"10.1016/j.compeleceng.2025.110741","DOIUrl":null,"url":null,"abstract":"<div><div>Pedestrian behavior prediction is one of the most critical tasks in urban driving scenarios, playing a key role in ensuring road safety. Traditional learning-based methods have relied on vision models for pedestrian behavior prediction. However, fully understanding pedestrians’ behaviors in advance is very challenging due to the complex driving environments and the multifaceted interactions between pedestrians and road elements. Additionally, these methods often show a limited understanding of driving environments not included in the training. The emergence of Multimodal Large Language Models (MLLMs) provides an innovative approach to addressing these challenges through advanced reasoning capabilities. This paper presents OmniPredict, the first study to apply GPT-4o(mni), a state-of-the-art MLLM, for pedestrian behavior prediction in urban driving scenarios. We assessed the model using the JAAD and WiDEVIEW datasets, which are widely used for pedestrian behavior analysis. Our method utilized multiple contextual modalities and achieved 67% accuracy in a zero-shot setting without any task-specific training, surpassing the performance of the latest MLLM baselines by 10%. Furthermore, when incorporating additional contextual information, the experimental results demonstrated a significant increase in prediction accuracy across four behavior types (crossing, occlusion, action, and look). We also validated the model s generalization ability by comparing its responses across various road environment scenarios. OmniPredict exhibits strong generalization capabilities, demonstrating robust decision-making in diverse and unseen driving rare scenarios. These findings highlight the potential of MLLMs to enhance pedestrian behavior prediction, paving the way for safer and more informed decision-making in road environments.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"129 ","pages":"Article 110741"},"PeriodicalIF":4.9000,"publicationDate":"2025-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625006846","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Pedestrian behavior prediction is one of the most critical tasks in urban driving scenarios, playing a key role in ensuring road safety. Traditional learning-based methods have relied on vision models for pedestrian behavior prediction. However, fully understanding pedestrians’ behaviors in advance is very challenging due to the complex driving environments and the multifaceted interactions between pedestrians and road elements. Additionally, these methods often show a limited understanding of driving environments not included in the training. The emergence of Multimodal Large Language Models (MLLMs) provides an innovative approach to addressing these challenges through advanced reasoning capabilities. This paper presents OmniPredict, the first study to apply GPT-4o(mni), a state-of-the-art MLLM, for pedestrian behavior prediction in urban driving scenarios. We assessed the model using the JAAD and WiDEVIEW datasets, which are widely used for pedestrian behavior analysis. Our method utilized multiple contextual modalities and achieved 67% accuracy in a zero-shot setting without any task-specific training, surpassing the performance of the latest MLLM baselines by 10%. Furthermore, when incorporating additional contextual information, the experimental results demonstrated a significant increase in prediction accuracy across four behavior types (crossing, occlusion, action, and look). We also validated the model s generalization ability by comparing its responses across various road environment scenarios. OmniPredict exhibits strong generalization capabilities, demonstrating robust decision-making in diverse and unseen driving rare scenarios. These findings highlight the potential of MLLMs to enhance pedestrian behavior prediction, paving the way for safer and more informed decision-making in road environments.
使用gpt - 40进行多模式理解以增强可推广的行人行为预测
行人行为预测是城市驾驶场景中最关键的任务之一,对保障道路安全起着至关重要的作用。传统的基于学习的方法依赖于视觉模型来预测行人的行为。然而,由于复杂的驾驶环境以及行人与道路要素之间的多方面相互作用,提前充分了解行人的行为是非常具有挑战性的。此外,这些方法通常对训练中未包括的驾驶环境的理解有限。多模态大型语言模型(mllm)的出现为通过高级推理能力解决这些挑战提供了一种创新的方法。本文介绍了OmniPredict,这是第一个应用gpt - 40 (mni)的研究,这是一个最先进的mlm,用于城市驾驶场景下的行人行为预测。我们使用广泛用于行人行为分析的JAAD和WiDEVIEW数据集来评估模型。我们的方法利用了多种上下文模式,在没有任何任务特定训练的情况下,在零射击设置中实现了67%的准确率,比最新的MLLM基线的性能高出10%。此外,当纳入额外的上下文信息时,实验结果表明,在四种行为类型(交叉、遮挡、动作和看)中,预测准确率显著提高。我们还通过比较模型在不同道路环境场景下的响应来验证模型的泛化能力。OmniPredict展示了强大的泛化能力,在各种看不见的罕见驾驶场景中展示了强大的决策能力。这些发现强调了mllm在增强行人行为预测方面的潜力,为道路环境中更安全和更明智的决策铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers & Electrical Engineering
Computers & Electrical Engineering 工程技术-工程:电子与电气
CiteScore
9.20
自引率
7.00%
发文量
661
审稿时长
47 days
期刊介绍: The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信