Improving internet of vehicles research: A systematic preprocessing framework for the VeReMi dataset

IF 1 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-04-28 DOI:10.1016/j.dib.2025.111599

Aparup Roy , Debotosh Bhattacharjee , Ondrej Krejcar

{"title":"Improving internet of vehicles research: A systematic preprocessing framework for the VeReMi dataset","authors":"Aparup Roy , Debotosh Bhattacharjee , Ondrej Krejcar","doi":"10.1016/j.dib.2025.111599","DOIUrl":null,"url":null,"abstract":"<div><div>The Vehicular Reference Misbehavior Dataset (VeReMi) is a vital resource for advancing Intelligent Transportation Systems (ITS) and the Internet of Vehicles (IoV). However, its large size (∼7 GB) and inherent class imbalance pose significant challenges for machine learning model development. This paper presents a preprocessing framework to enhance VeReMi’s usability and relevance. Through 10 % down-sampling, the dataset was reduced to ∼724MB, making it computationally manageable. Biases were addressed by balancing benign and malicious samples through synthesis and identifying benign instances using predefined criteria. A refined feature set, including key attributes like <em>rcvTime, pos_0, pos_1,</em> and <em>attack_type</em> (renamed <em>attacker_type</em>), was selected to improve machine learning compatibility. This preprocessing pipeline effectively maintains data integrity and preserves the representativeness of malicious patterns. The optimized dataset is well-suited for ITS and IoV applications, such as anomaly detection and network security, underscoring the crucial role of preprocessing in overcoming real-world constraints and enhancing model performance.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111599"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925003312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The Vehicular Reference Misbehavior Dataset (VeReMi) is a vital resource for advancing Intelligent Transportation Systems (ITS) and the Internet of Vehicles (IoV). However, its large size (∼7 GB) and inherent class imbalance pose significant challenges for machine learning model development. This paper presents a preprocessing framework to enhance VeReMi’s usability and relevance. Through 10 % down-sampling, the dataset was reduced to ∼724MB, making it computationally manageable. Biases were addressed by balancing benign and malicious samples through synthesis and identifying benign instances using predefined criteria. A refined feature set, including key attributes like rcvTime, pos_0, pos_1, and attack_type (renamed attacker_type), was selected to improve machine learning compatibility. This preprocessing pipeline effectively maintains data integrity and preserves the representativeness of malicious patterns. The optimized dataset is well-suited for ITS and IoV applications, such as anomaly detection and network security, underscoring the crucial role of preprocessing in overcoming real-world constraints and enhancing model performance.

查看原文本刊更多论文

改进车联网研究：VeReMi数据集的系统预处理框架

车辆参考不当行为数据集（VeReMi）是推进智能交通系统（ITS）和车联网（IoV）的重要资源。然而，它的大尺寸（~ 7gb）和固有的类不平衡给机器学习模型的开发带来了重大挑战。本文提出了一个预处理框架，以提高VeReMi的可用性和相关性。通过10%的降采样，数据集减少到~ 724MB，使其在计算上可管理。通过合成和使用预定义标准识别良性实例来平衡良性和恶意样本，从而解决了偏差。为了提高机器学习的兼容性，我们选择了一个精炼的特性集，包括rcvTime、pos_0、pos_1和attack_type（已更名为attacker_type）等关键属性。该预处理管道有效地维护了数据的完整性，并保留了恶意模式的代表性。优化后的数据集非常适合ITS和车联网应用，如异常检测和网络安全，强调了预处理在克服现实世界约束和提高模型性能方面的关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.