{"title":"YOLO-HyperVision: A vision transformer backbone-based enhancement of YOLOv5 for detection of dynamic traffic information","authors":"Shizhou Xu, Mengjie Zhang , Jingyu Chen, Yiming Zhong","doi":"10.1016/j.eij.2024.100523","DOIUrl":null,"url":null,"abstract":"<div><p>With the increase of traffic flow in modern urban areas, traffic congestion has become a serious problem that affects people’s normal production and life. Using target detection technology instead of manual labor can quickly detect the road traffic situation and provide timely information about the traffic flow. However, when using drones to observe the traffic flow in the air, the perspective effect will cause the detected vehicles and pedestrians to be very small, and the scale difference between different categories of targets is large, which increases the detection difficulty of a single convolutional neural network model. In order to solve the problem of low accuracy of traditional single-stage target detection models, this study proposes an improved Yolov5 vehicle target detection model with Vision Transformer (VIT) backbone, You Only Look Once-HyperVision (YOLO-HV), which aims to solve the problem of poor multi-scale target recognition performance caused by the inability of traditional CNN networks to integrate contextual information, and help drones achieve more efficient and accurate traffic flow recognition functions. This study deeply integrates the Vision Transformer (VIT) backbone and Convolutional Neural Network (CNN), effectively combining the multi-scale detection advantages of Vision Transformer and the inductive bias ability of Convolutional Neural Network, and adds multi-scale residual modules and context correlation enhancement modules, which greatly improves the recognition accuracy of single-stage detectors for drone images. Through comparative experiments on the VisDrone dataset, it is found that the detection performance of this model is improved compared with several commonly used detection models. YOLO-HV can increase the mean average precision (mAP) by 3.3% compared with the pure convolutional network of the same model size. YOLO-HV model has achieved excellent performance in the task of traffic flow image detection taken by drones, and can more accurately identify and classify road vehicles than various target detection models.</p></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110866524000860/pdfft?md5=50b127ef84d8fdf25c77f2d161914ee0&pid=1-s2.0-S1110866524000860-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866524000860","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the increase of traffic flow in modern urban areas, traffic congestion has become a serious problem that affects people’s normal production and life. Using target detection technology instead of manual labor can quickly detect the road traffic situation and provide timely information about the traffic flow. However, when using drones to observe the traffic flow in the air, the perspective effect will cause the detected vehicles and pedestrians to be very small, and the scale difference between different categories of targets is large, which increases the detection difficulty of a single convolutional neural network model. In order to solve the problem of low accuracy of traditional single-stage target detection models, this study proposes an improved Yolov5 vehicle target detection model with Vision Transformer (VIT) backbone, You Only Look Once-HyperVision (YOLO-HV), which aims to solve the problem of poor multi-scale target recognition performance caused by the inability of traditional CNN networks to integrate contextual information, and help drones achieve more efficient and accurate traffic flow recognition functions. This study deeply integrates the Vision Transformer (VIT) backbone and Convolutional Neural Network (CNN), effectively combining the multi-scale detection advantages of Vision Transformer and the inductive bias ability of Convolutional Neural Network, and adds multi-scale residual modules and context correlation enhancement modules, which greatly improves the recognition accuracy of single-stage detectors for drone images. Through comparative experiments on the VisDrone dataset, it is found that the detection performance of this model is improved compared with several commonly used detection models. YOLO-HV can increase the mean average precision (mAP) by 3.3% compared with the pure convolutional network of the same model size. YOLO-HV model has achieved excellent performance in the task of traffic flow image detection taken by drones, and can more accurately identify and classify road vehicles than various target detection models.
期刊介绍:
The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.