Beyond vision: A unified transformer with bidirectional attention for predicting driver perceived risk from multi-modal data

IF 7.6 1区工程技术 Q1 TRANSPORTATION SCIENCE & TECHNOLOGY

Transportation Research Part C-Emerging Technologies Pub Date : 2025-07-11 DOI:10.1016/j.trc.2025.105270

Dongjie Liu , Dawei Li , Hongliang Ding , Yang Cao , Kun Gao

{"title":"Beyond vision: A unified transformer with bidirectional attention for predicting driver perceived risk from multi-modal data","authors":"Dongjie Liu , Dawei Li , Hongliang Ding , Yang Cao , Kun Gao","doi":"10.1016/j.trc.2025.105270","DOIUrl":null,"url":null,"abstract":"<div><div>Modeling driver perceived risk (or subjective risk) plays a critical role in improving driving safety, as different drivers often perceive varying levels of risk under identical conditions, prompting adjustments in their driving behavior. Driving is a complex activity involving multiple cognitive and perceptual processes, such as visual information, driver feedback, vehicle dynamics, and traffic and environmental conditions. However, existing models for subjective risk perception have yet to fully address the need for integrating multi-modal data. To address this gap, we present a Transformer-based model aimed at processing multimodal inputs in a unified manner to enhance the prediction of subjective risk perception. Unlike existing methodologies that extract features specific to each modality, it employs embedding layers to transform images, unstructured, and structured fields into visual and text tokens. Subsequently, bi-directional multimodal attention blocks with inter-modal and intra-modal attention mechanisms capture comprehensive representations of traffic scene images, unstructured traffic scene descriptions, structured traffic data, environmental statistics, and demographics. Experimental results show that the proposed unified model achieves superior predictive performance over existing benchmarks while maintaining reasonable interpretability. Furthermore, the model is generalizable, making it applicable to various multi-modal prediction tasks across different transportation contexts.</div></div>","PeriodicalId":54417,"journal":{"name":"Transportation Research Part C-Emerging Technologies","volume":"179 ","pages":"Article 105270"},"PeriodicalIF":7.6000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transportation Research Part C-Emerging Technologies","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0968090X25002748","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TRANSPORTATION SCIENCE & TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Modeling driver perceived risk (or subjective risk) plays a critical role in improving driving safety, as different drivers often perceive varying levels of risk under identical conditions, prompting adjustments in their driving behavior. Driving is a complex activity involving multiple cognitive and perceptual processes, such as visual information, driver feedback, vehicle dynamics, and traffic and environmental conditions. However, existing models for subjective risk perception have yet to fully address the need for integrating multi-modal data. To address this gap, we present a Transformer-based model aimed at processing multimodal inputs in a unified manner to enhance the prediction of subjective risk perception. Unlike existing methodologies that extract features specific to each modality, it employs embedding layers to transform images, unstructured, and structured fields into visual and text tokens. Subsequently, bi-directional multimodal attention blocks with inter-modal and intra-modal attention mechanisms capture comprehensive representations of traffic scene images, unstructured traffic scene descriptions, structured traffic data, environmental statistics, and demographics. Experimental results show that the proposed unified model achieves superior predictive performance over existing benchmarks while maintaining reasonable interpretability. Furthermore, the model is generalizable, making it applicable to various multi-modal prediction tasks across different transportation contexts.

查看原文本刊更多论文

超越视觉：一个具有双向关注的统一变压器，用于从多模态数据预测驾驶员感知风险

驾驶员感知风险（或主观风险）建模在提高驾驶安全性方面起着至关重要的作用，因为不同的驾驶员在相同的条件下通常会感知到不同程度的风险，从而促使他们调整驾驶行为。驾驶是一个复杂的活动，涉及多个认知和知觉过程，如视觉信息、驾驶员反馈、车辆动力学、交通和环境条件。然而，现有的主观风险感知模型尚未充分解决整合多模态数据的需要。为了解决这一差距，我们提出了一个基于变压器的模型，旨在以统一的方式处理多模态输入，以增强对主观风险感知的预测。与提取特定于每种模态的特征的现有方法不同，它使用嵌入层将图像、非结构化和结构化字段转换为视觉和文本标记。随后，具有多式联运和多式联运注意机制的双向多模式注意块捕获了交通场景图像、非结构化交通场景描述、结构化交通数据、环境统计和人口统计数据的综合表征。实验结果表明，该统一模型在保持合理可解释性的同时，取得了优于现有基准的预测性能。此外，该模型具有通用性，适用于不同运输环境下的各种多模式预测任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transportation Research Part C-Emerging Technologies 工程技术-运输科技

CiteScore

15.80

自引率

12.00%

发文量

332

审稿时长

64 days

期刊介绍： Transportation Research: Part C (TR_C) is dedicated to showcasing high-quality, scholarly research that delves into the development, applications, and implications of transportation systems and emerging technologies. Our focus lies not solely on individual technologies, but rather on their broader implications for the planning, design, operation, control, maintenance, and rehabilitation of transportation systems, services, and components. In essence, the intellectual core of the journal revolves around the transportation aspect rather than the technology itself. We actively encourage the integration of quantitative methods from diverse fields such as operations research, control systems, complex networks, computer science, and artificial intelligence. Join us in exploring the intersection of transportation systems and emerging technologies to drive innovation and progress in the field.