筛：生成用于SIEM事件分类的网络安全日志数据集集合

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2025-05-08 DOI:10.1016/j.comnet.2025.111330

Pierpaolo Artioli , Vincenzo Dentamaro , Stefano Galantucci , Alessio Magrì , Gianluca Pellegrini , Gianfranco Semeraro

{"title":"筛：生成用于SIEM事件分类的网络安全日志数据集集合","authors":"Pierpaolo Artioli , Vincenzo Dentamaro , Stefano Galantucci , Alessio Magrì , Gianluca Pellegrini , Gianfranco Semeraro","doi":"10.1016/j.comnet.2025.111330","DOIUrl":null,"url":null,"abstract":"<div><div>Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"266 ","pages":"Article 111330"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SIEVE: Generating a cybersecurity log dataset collection for SIEM event classification\",\"authors\":\"Pierpaolo Artioli , Vincenzo Dentamaro , Stefano Galantucci , Alessio Magrì , Gianluca Pellegrini , Gianfranco Semeraro\",\"doi\":\"10.1016/j.comnet.2025.111330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"266 \",\"pages\":\"Article 111330\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S138912862500297X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S138912862500297X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

有效的网络威胁监控依赖于部署强大的安全信息和事件管理（SIEM）系统。SIEM应用程序接收由不同设备、系统和应用程序生成的安全事件。他们应该正确地将它们关联起来，以识别基于战术、技术和程序（TTP）的潜在网络威胁，绕过其他安全机制（例如，防火墙、IDS等）。考虑到日志的生成主要是为了以人类可读的格式通知相关的系统事件和活动，监督自然语言处理（NLP）技术可以用于训练模型，通过自动建议将事件分类到预定义的类别中来补充传统的解析方法。训练这样的模型需要大量不同类型的预分类（标记）数据，以提供做出准确预测所需的学习模式和细微差别。由于隐私或可用性原因，安全事件数据集的数量很少，并且少数公开可用的数据集通常在事件多样性，标签数量或根本不适合手头的任务方面受到限制，因此用于训练与siem相关的机器学习事件分类算法的有效合成数据集可能非常有用。基于这些原因，本文建议生成一个专门用于训练SIEM系统进行日志类型分类的合成数据集。本研究论文从对文献中可用的突出网络安全相关数据集的深入方法分析开始，介绍了SIEVE (Siem ingingevents)，这是一个使用SPICE（语义扰动和内容浓缩实例化）从公开可用的日志样本构建的合成数据集集合，这是一种新的文本增强和扰动技术。SPICE在生成真实日志方面是有效的。数据集集合的每个实例显示不同级别的增强。随后的绩效评估是通过对各种自然语言处理分类模型进行全面的基准测试进行的。测试是通过使用SIEVE训练分类器并在相同的SIEVE日志和真实日志上测试它们来进行的。实验结果表明，SVM模型（MaF1 0.9323 ~ 0.9737）的性能最好，即使在真实日志（MaF1 0.9477 ~ 0.9636）的测试中，其性能也略有下降。另一方面，BERT在SIEVE （MaF1 0.9528 - 0.9730）上的大多数测试中表现优于SVM，但在真实日志（MaF1 0.8864 - 0.9182）上的测试中表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SIEVE: Generating a cybersecurity log dataset collection for SIEM event classification

Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.