{"title":"基于视频的交通事故预测的时空文本自适应多模态体系结构","authors":"Patrik Patera;Yie-Tarng Chen;Wen-Hsien Fang","doi":"10.1109/TCSVT.2025.3552895","DOIUrl":null,"url":null,"abstract":"Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8989-9002"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation\",\"authors\":\"Patrik Patera;Yie-Tarng Chen;Wen-Hsien Fang\",\"doi\":\"10.1109/TCSVT.2025.3552895\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8989-9002\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10933925/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10933925/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation
Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.