{"title":"Autonomous vehicle crash risk modeling by integrating data augmentation and two-layer stacking","authors":"Leipeng Zhu , Zhiqing Zhang , Yongnan Zhang , Jingyang Yu , Hongjia Wang","doi":"10.1016/j.compind.2025.104320","DOIUrl":null,"url":null,"abstract":"<div><div>Autonomous vehicle (AV) technology aims to eliminate traffic crashes caused by driver errors, but its adoption has introduced new types of crashes. Due to the high dimensionality and limited sample size of AV crash data, identifying underlying risk factors remains challenging, and crash predictive performance is often suboptimal. To address these issues, this study develops an interpretable data augmentation strategy and the optimized two-layer stacking algorithm, further integrating them into a unified framework that accurately identifies key crash contributing factors and significantly improves predictive performance. The findings reveal that: 1) AV crashes show significant variation in their temporal distributions but follow consistent spatial agglomeration patterns. 2) AV reliability significantly decreases in high-interaction scenarios, with peak travel times and uncertain road conditions identified as key contributing factors. 3) The data augmentation algorithm enhances on key contributing factors and the feature crosses, enhances the model’s ability to capture nonlinear relationships in crash data and improves predictive accuracy in small-sample scenarios, particularly for injury-related crashes. 4) The optimized two-layer stacking algorithm integrates the heterogeneous learning capabilities of models such as LightGBM and Random Forest, significantly improving the ability to recognize complex crash patterns. When combined with data augmentation, the framework achieves strong predictive performance, with both precision and recall reaching 0.92 and the area under the receiver operating characteristic curve at 0.96. Compared to existing machine learning approaches, this framework shows notable advantages in handling high-dimensional small-sample AV crash data. The framework provides an effective solution for AV crash risk modeling and safety design, contributing to the development and implementation of safer intelligent transportation systems.</div></div>","PeriodicalId":55219,"journal":{"name":"Computers in Industry","volume":"171 ","pages":"Article 104320"},"PeriodicalIF":9.1000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Industry","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166361525000855","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Autonomous vehicle (AV) technology aims to eliminate traffic crashes caused by driver errors, but its adoption has introduced new types of crashes. Due to the high dimensionality and limited sample size of AV crash data, identifying underlying risk factors remains challenging, and crash predictive performance is often suboptimal. To address these issues, this study develops an interpretable data augmentation strategy and the optimized two-layer stacking algorithm, further integrating them into a unified framework that accurately identifies key crash contributing factors and significantly improves predictive performance. The findings reveal that: 1) AV crashes show significant variation in their temporal distributions but follow consistent spatial agglomeration patterns. 2) AV reliability significantly decreases in high-interaction scenarios, with peak travel times and uncertain road conditions identified as key contributing factors. 3) The data augmentation algorithm enhances on key contributing factors and the feature crosses, enhances the model’s ability to capture nonlinear relationships in crash data and improves predictive accuracy in small-sample scenarios, particularly for injury-related crashes. 4) The optimized two-layer stacking algorithm integrates the heterogeneous learning capabilities of models such as LightGBM and Random Forest, significantly improving the ability to recognize complex crash patterns. When combined with data augmentation, the framework achieves strong predictive performance, with both precision and recall reaching 0.92 and the area under the receiver operating characteristic curve at 0.96. Compared to existing machine learning approaches, this framework shows notable advantages in handling high-dimensional small-sample AV crash data. The framework provides an effective solution for AV crash risk modeling and safety design, contributing to the development and implementation of safer intelligent transportation systems.
期刊介绍:
The objective of Computers in Industry is to present original, high-quality, application-oriented research papers that:
• Illuminate emerging trends and possibilities in the utilization of Information and Communication Technology in industry;
• Establish connections or integrations across various technology domains within the expansive realm of computer applications for industry;
• Foster connections or integrations across diverse application areas of ICT in industry.