基于视频转换器的多视角多模式驾驶员分心检测对比学习

2023 IEEE Region 10 Symposium (TENSYMP) Pub Date : 2023-09-06 DOI:10.1109/TENSYMP55890.2023.10223643

Hong Vin Koay, Joon Huang Chuah, C. Chow

{"title":"基于视频转换器的多视角多模式驾驶员分心检测对比学习","authors":"Hong Vin Koay, Joon Huang Chuah, C. Chow","doi":"10.1109/TENSYMP55890.2023.10223643","DOIUrl":null,"url":null,"abstract":"Distracted drivers are more likely to get involved in a fatal accident. Thus, detecting actions that may led to distraction should be prioritized to reduce road accidents. However, there are many actions that cause a driver to pivot his attention away from the road. Previous works on detecting distracted drivers are done through a defined set of actions that are considered as distraction. This type of dataset is known as ‘closed set’ since there are still many distraction actions that were not considered by the model. Being different from previous datasets and approaches, in this work, we utilize constructive learning to detect distractions through multiview and multimodal video. The dataset used is the Driver Anomaly Detection dataset. The model is tasked to identify normal and anomalous driving condition in an ‘open set’ manner, where there are unseen anomalous driving condition in the test set. We use Video Transformer as the backbone of the model and validate that the performance is better than convolutional-based backbone. Two views (front and top) of driving clips on two modalities (IR and depth) are used to train individual model. The results of different views and modalities are subsequently fused together. Our method achieves 0.9892 AUC and 97.02% accuracy with Swin-Tiny when considering both views and modalities.","PeriodicalId":314726,"journal":{"name":"2023 IEEE Region 10 Symposium (TENSYMP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Contrastive Learning with Video Transformer for Driver Distraction Detection through Multiview and Multimodal Video\",\"authors\":\"Hong Vin Koay, Joon Huang Chuah, C. Chow\",\"doi\":\"10.1109/TENSYMP55890.2023.10223643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distracted drivers are more likely to get involved in a fatal accident. Thus, detecting actions that may led to distraction should be prioritized to reduce road accidents. However, there are many actions that cause a driver to pivot his attention away from the road. Previous works on detecting distracted drivers are done through a defined set of actions that are considered as distraction. This type of dataset is known as ‘closed set’ since there are still many distraction actions that were not considered by the model. Being different from previous datasets and approaches, in this work, we utilize constructive learning to detect distractions through multiview and multimodal video. The dataset used is the Driver Anomaly Detection dataset. The model is tasked to identify normal and anomalous driving condition in an ‘open set’ manner, where there are unseen anomalous driving condition in the test set. We use Video Transformer as the backbone of the model and validate that the performance is better than convolutional-based backbone. Two views (front and top) of driving clips on two modalities (IR and depth) are used to train individual model. The results of different views and modalities are subsequently fused together. Our method achieves 0.9892 AUC and 97.02% accuracy with Swin-Tiny when considering both views and modalities.\",\"PeriodicalId\":314726,\"journal\":{\"name\":\"2023 IEEE Region 10 Symposium (TENSYMP)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Region 10 Symposium (TENSYMP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TENSYMP55890.2023.10223643\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Region 10 Symposium (TENSYMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TENSYMP55890.2023.10223643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分心的司机更有可能卷入致命事故。因此，检测可能导致分心的行为应该优先考虑，以减少交通事故。然而，有许多行为会导致司机将注意力从道路上转移开。以前检测分心司机的工作是通过一组被定义为分心的动作来完成的。这种类型的数据集被称为“封闭集”，因为仍然有许多分散注意力的动作没有被模型考虑。与以前的数据集和方法不同，在这项工作中，我们利用建设性学习来通过多视图和多模态视频检测干扰。使用的数据集为驱动异常检测数据集。该模型的任务是以“开放集”的方式识别正常和异常驾驶条件，其中在测试集中存在未见的异常驾驶条件。我们使用Video Transformer作为模型的主干，并验证了其性能优于基于卷积的主干。两种模式(IR和depth)的驱动剪辑的两个视图(正面和顶部)用于训练单个模型。不同观点和模式的结果随后融合在一起。在考虑视图和模态的情况下，该方法的AUC为0.9892，准确率为97.02%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Contrastive Learning with Video Transformer for Driver Distraction Detection through Multiview and Multimodal Video

Distracted drivers are more likely to get involved in a fatal accident. Thus, detecting actions that may led to distraction should be prioritized to reduce road accidents. However, there are many actions that cause a driver to pivot his attention away from the road. Previous works on detecting distracted drivers are done through a defined set of actions that are considered as distraction. This type of dataset is known as ‘closed set’ since there are still many distraction actions that were not considered by the model. Being different from previous datasets and approaches, in this work, we utilize constructive learning to detect distractions through multiview and multimodal video. The dataset used is the Driver Anomaly Detection dataset. The model is tasked to identify normal and anomalous driving condition in an ‘open set’ manner, where there are unseen anomalous driving condition in the test set. We use Video Transformer as the backbone of the model and validate that the performance is better than convolutional-based backbone. Two views (front and top) of driving clips on two modalities (IR and depth) are used to train individual model. The results of different views and modalities are subsequently fused together. Our method achieves 0.9892 AUC and 97.02% accuracy with Swin-Tiny when considering both views and modalities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE Region 10 Symposium (TENSYMP)

自引率

0.00%

发文量