{"title":"Sentence-resampled BERT-CRF model for autonomous vehicle crash causality analysis from large-scale accident narrative text data.","authors":"Ruixu Pan, Quan Yuan, Jiaming Cao, Chonghao Zhang, Chengcheng Yu, Qian Liu, Chao Yang, Xingyu Liang","doi":"10.1016/j.aap.2025.108184","DOIUrl":null,"url":null,"abstract":"<p><p>As autonomous vehicles (AVs) have been increasingly used, exploring crash causality mechanisms is critical to improving traffic safety related to AVs use. However, existing studies have primarily employed structured data to analyze such causality, while limited efforts have been made to identify causality from unstructured crash narratives, which are featured by data imbalance and small sample sizes. Original crash narratives contain a wealth of latent information about AV crashes that can further the understanding of AV safety. This study proposes a Sentence-resampled BERT-CRF model combined with a DREAM-inspired hierarchical causal attribution framework to systematically analyze the causality mechanisms of AV crashes based on original crash narratives. First, an annotation scheme combining \"BIO\" and \"C-P-R-D\" tags is designed to capture temporal causal relationships in crash narratives and extract causal movement chain (CMC) by the BERT-CRF model. Meanwhile, the data imbalance problem is mitigated by using the sentence-level resampling method, and the results show that the model is 98.03% accurate on the complete dataset, and maintains 96.14% accuracy with a small sample of 10%. Then, a two-tier causal attribution framework(5 categories and 52 elements) inspired by DREAM theory is developed to identify 16 categories of typical scenarios, with rear-end(48.57%) and lane-change (17.04%) collisions as high-risk scenarios. In-depth analysis shows that rear-end crashes are mostly caused by the coupling of a conventional vehicle (CV) following too close to the AV(B5) and the AV's insufficient decisive decision to slow down (A2), while lane-change crashes are associated with the CV's hazardous lane-change (B2) and the delay of AV's intent recognition. The proposed framework bridges the gap between unstructured narratives data and structured causal inference, revealing human-computer interaction deficiencies, environment perception limitations, and roadway facility impacts as the core causal factors. These findings provide data-driven theoretical support for AV manufacturers to optimize sensing algorithms and traffic authorities to develop corresponding regulations.</p>","PeriodicalId":6926,"journal":{"name":"Accident; analysis and prevention","volume":"221 ","pages":"108184"},"PeriodicalIF":6.2000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accident; analysis and prevention","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1016/j.aap.2025.108184","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ERGONOMICS","Score":null,"Total":0}
引用次数: 0
Abstract
As autonomous vehicles (AVs) have been increasingly used, exploring crash causality mechanisms is critical to improving traffic safety related to AVs use. However, existing studies have primarily employed structured data to analyze such causality, while limited efforts have been made to identify causality from unstructured crash narratives, which are featured by data imbalance and small sample sizes. Original crash narratives contain a wealth of latent information about AV crashes that can further the understanding of AV safety. This study proposes a Sentence-resampled BERT-CRF model combined with a DREAM-inspired hierarchical causal attribution framework to systematically analyze the causality mechanisms of AV crashes based on original crash narratives. First, an annotation scheme combining "BIO" and "C-P-R-D" tags is designed to capture temporal causal relationships in crash narratives and extract causal movement chain (CMC) by the BERT-CRF model. Meanwhile, the data imbalance problem is mitigated by using the sentence-level resampling method, and the results show that the model is 98.03% accurate on the complete dataset, and maintains 96.14% accuracy with a small sample of 10%. Then, a two-tier causal attribution framework(5 categories and 52 elements) inspired by DREAM theory is developed to identify 16 categories of typical scenarios, with rear-end(48.57%) and lane-change (17.04%) collisions as high-risk scenarios. In-depth analysis shows that rear-end crashes are mostly caused by the coupling of a conventional vehicle (CV) following too close to the AV(B5) and the AV's insufficient decisive decision to slow down (A2), while lane-change crashes are associated with the CV's hazardous lane-change (B2) and the delay of AV's intent recognition. The proposed framework bridges the gap between unstructured narratives data and structured causal inference, revealing human-computer interaction deficiencies, environment perception limitations, and roadway facility impacts as the core causal factors. These findings provide data-driven theoretical support for AV manufacturers to optimize sensing algorithms and traffic authorities to develop corresponding regulations.
期刊介绍:
Accident Analysis & Prevention provides wide coverage of the general areas relating to accidental injury and damage, including the pre-injury and immediate post-injury phases. Published papers deal with medical, legal, economic, educational, behavioral, theoretical or empirical aspects of transportation accidents, as well as with accidents at other sites. Selected topics within the scope of the Journal may include: studies of human, environmental and vehicular factors influencing the occurrence, type and severity of accidents and injury; the design, implementation and evaluation of countermeasures; biomechanics of impact and human tolerance limits to injury; modelling and statistical analysis of accident data; policy, planning and decision-making in safety.