基于视频的交通事故预测的时空文本自适应多模态体系结构

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Patrik Patera;Yie-Tarng Chen;Wen-Hsien Fang
{"title":"基于视频的交通事故预测的时空文本自适应多模态体系结构","authors":"Patrik Patera;Yie-Tarng Chen;Wen-Hsien Fang","doi":"10.1109/TCSVT.2025.3552895","DOIUrl":null,"url":null,"abstract":"Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8989-9002"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation\",\"authors\":\"Patrik Patera;Yie-Tarng Chen;Wen-Hsien Fang\",\"doi\":\"10.1109/TCSVT.2025.3552895\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8989-9002\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10933925/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10933925/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

在先进的交通系统中,早期和精确的事故预测对于预防道路交通事故至关重要。本文提出了一个具有时空文本自适应(MASTTA)的多模态架构,在一个流线型的端到端框架内具有视觉编码器和文本编码器,用于交通事故预测。两个编码器都利用CLIP模型,在大规模文本图像对上进行预训练,有效地利用视觉和文本信息。MASTTA通过对适配器进行微调来捕获复杂的流量模式和关系,从而减少了再培训需求。在视觉编码器中,时空适应是通过一个新颖的时间适配器、空间适配器和MLP适配器来实现的。时间适配器增强了事故易发区域的时间一致性,而空间适配器捕获了视觉线索之间的时空相互作用。文本编码器配备了一个文本适配器和一个MLP适配器,在联合嵌入空间中对齐潜在的文本和视觉特征,精炼语义表示。这种文本和视觉适配器的协同作用使MASTTA能够跨长时间上下文模拟复杂的空间交互,提高事故预测。我们在DAD和CCD数据集上验证了MASTTA,与最先进的方法相比,在早期和正确性方面都有了显著的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation
Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信