Feature Engineering and Decision Trees for Predicting High Crash-Risk Locations Using Roadway Indicators

Dimitrios Sarigiannis, Maria Atzemi, Jimi B. Oke, Eleni Christofa, S. Gerasimidis
{"title":"Feature Engineering and Decision Trees for Predicting High Crash-Risk Locations Using Roadway Indicators","authors":"Dimitrios Sarigiannis, Maria Atzemi, Jimi B. Oke, Eleni Christofa, S. Gerasimidis","doi":"10.1177/03611981231217497","DOIUrl":null,"url":null,"abstract":"Road crashes are a prevalent public health issue across the globe. The objective of this research was to develop a methodology for accurately classifying high-risk crash locations. The hypothesis of this study was that readily obtained roadway indicators can be used along with machine learning techniques to categorize locations as high crash-risk. A database containing 5,383 locations was created during 2012 to 2015 as part of the Hellenic National Road Safety Project and used to develop three binary machine learning models to classify high crash-risk locations based on roadway indicators. The three models were random forest, gradient boosting, and extra trees. This research used features engineering to reduce the number of indicators in the model, and the synthetic minority oversampling technique to address imbalances in the dataset between the minority (high crash-risk locations identified using crash reports) and majority classes (medium to low crash-risk locations identified based on local police testimonies, site inspections, and geometry analysis). Although all three models performed similarly, the extra trees model outperformed the other two on a range of performance metrics, including the area under the precision–recall curve and the F1-score. The findings revealed that design speeds, pavement markings, signage presence, and pavement condition were the most influential factors affecting roadway safety. The contribution of this research is in the development of a transferable methodology for classifying high crash-risk locations in addition to revealing key indicators for crash-risk potential, which in turn can inform cost-effective data collection and maintenance activities.","PeriodicalId":309251,"journal":{"name":"Transportation Research Record: Journal of the Transportation Research Board","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transportation Research Record: Journal of the Transportation Research Board","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/03611981231217497","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Road crashes are a prevalent public health issue across the globe. The objective of this research was to develop a methodology for accurately classifying high-risk crash locations. The hypothesis of this study was that readily obtained roadway indicators can be used along with machine learning techniques to categorize locations as high crash-risk. A database containing 5,383 locations was created during 2012 to 2015 as part of the Hellenic National Road Safety Project and used to develop three binary machine learning models to classify high crash-risk locations based on roadway indicators. The three models were random forest, gradient boosting, and extra trees. This research used features engineering to reduce the number of indicators in the model, and the synthetic minority oversampling technique to address imbalances in the dataset between the minority (high crash-risk locations identified using crash reports) and majority classes (medium to low crash-risk locations identified based on local police testimonies, site inspections, and geometry analysis). Although all three models performed similarly, the extra trees model outperformed the other two on a range of performance metrics, including the area under the precision–recall curve and the F1-score. The findings revealed that design speeds, pavement markings, signage presence, and pavement condition were the most influential factors affecting roadway safety. The contribution of this research is in the development of a transferable methodology for classifying high crash-risk locations in addition to revealing key indicators for crash-risk potential, which in turn can inform cost-effective data collection and maintenance activities.
利用道路指标预测高碰撞风险地点的特征工程和决策树
道路交通事故是全球普遍存在的公共健康问题。本研究的目的是开发一种方法,用于准确划分车祸高风险地点。本研究的假设是,可以将容易获得的道路指标与机器学习技术结合使用,将地点归类为碰撞事故高风险地点。作为希腊国家道路安全项目的一部分,该项目在 2012 年至 2015 年期间创建了一个包含 5383 个地点的数据库,并利用该数据库开发了三种二元机器学习模型,以根据道路指标对高碰撞风险地点进行分类。这三种模型分别是随机森林、梯度提升和额外树。这项研究使用了特征工程来减少模型中的指标数量,并使用合成少数群体超采样技术来解决数据集中少数群体(根据碰撞报告确定的高碰撞风险地点)和多数群体(根据当地警方证词、现场检查和几何分析确定的中低碰撞风险地点)之间的不平衡问题。虽然这三种模型的性能相似,但额外树模型在一系列性能指标上都优于其他两种模型,包括精确度-召回曲线下的面积和 F1 分数。研究结果表明,设计速度、路面标线、标志牌的存在以及路面状况是影响道路安全的最大因素。这项研究的贡献在于,除了揭示了潜在碰撞风险的关键指标外,还开发了一种可移植的方法,用于对高碰撞风险地点进行分类,从而为具有成本效益的数据收集和维护活动提供依据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信