Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning

IF 4.3 2区综合性期刊 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Sensors Journal Pub Date : 2024-10-31 DOI:10.1109/JSEN.2024.3486373

Zhao Li;Siyang Jiang;Rui Fu;Yingshi Guo;Chang Wang

{"title":"Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning","authors":"Zhao Li;Siyang Jiang;Rui Fu;Yingshi Guo;Chang Wang","doi":"10.1109/JSEN.2024.3486373","DOIUrl":null,"url":null,"abstract":"Driver gaze zone estimation (DGZE) is essential for detecting the driver’s state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver’s gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.","PeriodicalId":447,"journal":{"name":"IEEE Sensors Journal","volume":"24 24","pages":"42064-42078"},"PeriodicalIF":4.3000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Sensors Journal","FirstCategoryId":"103","ListUrlMain":"https://ieeexplore.ieee.org/document/10740606/","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Driver gaze zone estimation (DGZE) is essential for detecting the driver’s state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver’s gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.

查看原文本刊更多论文

基于三通道卷积优化视觉转换器和迁移学习的驾驶员注视区域估计

在智能驾驶系统中，驾驶员注视区域估计（DGZE）是检测驾驶员状态和接管规则制定的关键。然而，基于卷积神经网络（CNN）的多通道模型缺乏全局特征提取能力，参数数量多，计算复杂度高。因此，本文提出了一种利用三通道卷积优化视觉变压器（3C-CoViT）来估计驾驶员注视区域的新方法。该方法将纯ViT结构中的线性投影替换为卷积投影，将不同通道的输入图像转换为图像序列，然后加入卷积前馈网络提取标记的局部特征，增强相邻标记在空间维度上的相关性，提高模型的性能和效率。然后，我们基于迁移学习在GazeCapture数据集上预训练模型，然后在实际道路实验中构建的数据集上对模型进行微调。为了提高模型的可解释性，我们提出了一种新的可视化方法。实验结果表明，该方法能够准确识别驾驶员注视区域（平均准确率为98.04%），在准确率和可靠性方面均优于现有方法。消融研究证明了我们提出的方法比纯ViT的有效性，以及迁移学习和三通道信息输入的有益效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Sensors Journal 工程技术-工程：电子与电气

CiteScore

7.70

自引率

14.00%

发文量

2058

审稿时长

5.2 months

期刊介绍： The fields of interest of the IEEE Sensors Journal are the theory, design , fabrication, manufacturing and applications of devices for sensing and transducing physical, chemical and biological phenomena, with emphasis on the electronics and physics aspect of sensors and integrated sensors-actuators. IEEE Sensors Journal deals with the following: -Sensor Phenomenology, Modelling, and Evaluation -Sensor Materials, Processing, and Fabrication -Chemical and Gas Sensors -Microfluidics and Biosensors -Optical Sensors -Physical Sensors: Temperature, Mechanical, Magnetic, and others -Acoustic and Ultrasonic Sensors -Sensor Packaging -Sensor Networks -Sensor Applications -Sensor Systems: Signals, Processing, and Interfaces -Actuators and Sensor Power Systems -Sensor Signal Processing for high precision and stability (amplification, filtering, linearization, modulation/demodulation) and under harsh conditions (EMC, radiation, humidity, temperature); energy consumption/harvesting -Sensor Data Processing (soft computing with sensor data, e.g., pattern recognition, machine learning, evolutionary computation; sensor data fusion, processing of wave e.g., electromagnetic and acoustic; and non-wave, e.g., chemical, gravity, particle, thermal, radiative and non-radiative sensor data, detection, estimation and classification based on sensor data) -Sensors in Industrial Practice