Statistical analysis of DMV crash data

2016 IEEE Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2016-04-29 DOI:10.1109/SIEDS.2016.7489281

Wenting Tong, P. Cherian, Jianzhe Liu, Haoyu Li, Quanquan Gu

{"title":"Statistical analysis of DMV crash data","authors":"Wenting Tong, P. Cherian, Jianzhe Liu, Haoyu Li, Quanquan Gu","doi":"10.1109/SIEDS.2016.7489281","DOIUrl":null,"url":null,"abstract":"The purpose of this paper is to present statistical methods and models we used to find out factors that caused fatal car crashes and high damage cost. The benefit of our project is that the Virginia DMV can make some adjustments accordingly and reduce the number of crashes that are fatal and have high damage cost. The data we used is between 2010 and 2014 for both fatality analysis and damage cost analysis. Data of 2015 was used for fatality analysis only. In the first part of this paper, we will introduce how we find factors that caused fatal car crashes. Since the data are unbalanced, we first subsampled the non-fatal crashes and applied a higher weight for fatal crashes. When building the model, we used logistic regression model to predict whether an accident is fatal or not. To select features that are more important, we used factors that are all numeric and with correlation value more than 0.1. We obtained a recall of 40% in the prediction from the logistic regression. We also adopted Decision Tree in fatality analysis and built two models for 2010-2014 data as well as 2015 data. In the second part of this paper, we will discuss how we find factors that caused damage cost. Since values of damage cost variable are unbalanced, we proposed a two-state method to find critical factors of the damage cost. First, we used K nearest neighborhood (KNN) to predict whether the damage cost is 0 or not. Second, we built Lasso Regression on the data where the damage cost were not zero and discovered the factors that lead to the damage cost.","PeriodicalId":426864,"journal":{"name":"2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2016.7489281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The purpose of this paper is to present statistical methods and models we used to find out factors that caused fatal car crashes and high damage cost. The benefit of our project is that the Virginia DMV can make some adjustments accordingly and reduce the number of crashes that are fatal and have high damage cost. The data we used is between 2010 and 2014 for both fatality analysis and damage cost analysis. Data of 2015 was used for fatality analysis only. In the first part of this paper, we will introduce how we find factors that caused fatal car crashes. Since the data are unbalanced, we first subsampled the non-fatal crashes and applied a higher weight for fatal crashes. When building the model, we used logistic regression model to predict whether an accident is fatal or not. To select features that are more important, we used factors that are all numeric and with correlation value more than 0.1. We obtained a recall of 40% in the prediction from the logistic regression. We also adopted Decision Tree in fatality analysis and built two models for 2010-2014 data as well as 2015 data. In the second part of this paper, we will discuss how we find factors that caused damage cost. Since values of damage cost variable are unbalanced, we proposed a two-state method to find critical factors of the damage cost. First, we used K nearest neighborhood (KNN) to predict whether the damage cost is 0 or not. Second, we built Lasso Regression on the data where the damage cost were not zero and discovered the factors that lead to the damage cost.

查看原文本刊更多论文

车管所碰撞数据统计分析

本文的目的是提出统计方法和模型，我们用来找出导致致命的汽车碰撞和高损失成本的因素。我们的项目的好处是，弗吉尼亚州的DMV可以做出一些相应的调整，减少撞车的数量，是致命的和有很高的损害成本。我们使用的数据是2010年至2014年之间的死亡率分析和损害成本分析。2015年的数据仅用于病死率分析。在本文的第一部分，我们将介绍我们如何找到导致致命车祸的因素。由于数据是不平衡的，我们首先对非致命事故进行抽样，并对致命事故应用更高的权重。在建立模型时，我们使用逻辑回归模型来预测事故是否致命。为了选择更重要的特征，我们使用了所有数值和相关值大于0.1的因子。我们在逻辑回归预测中获得了40%的召回率。在病死率分析中，我们也采用了决策树的方法，建立了2010-2014年和2015年的两个模型。在本文的第二部分，我们将讨论如何找到造成损害成本的因素。由于损伤代价变量的值是不平衡的，提出了一种寻找损伤代价关键因子的双状态方法。首先，我们使用K近邻(KNN)来预测损害代价是否为0。其次，对损伤代价不为零的数据进行Lasso回归，找出导致损伤代价的因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量