{"title":"筛:生成用于SIEM事件分类的网络安全日志数据集集合","authors":"Pierpaolo Artioli , Vincenzo Dentamaro , Stefano Galantucci , Alessio Magrì , Gianluca Pellegrini , Gianfranco Semeraro","doi":"10.1016/j.comnet.2025.111330","DOIUrl":null,"url":null,"abstract":"<div><div>Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"266 ","pages":"Article 111330"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SIEVE: Generating a cybersecurity log dataset collection for SIEM event classification\",\"authors\":\"Pierpaolo Artioli , Vincenzo Dentamaro , Stefano Galantucci , Alessio Magrì , Gianluca Pellegrini , Gianfranco Semeraro\",\"doi\":\"10.1016/j.comnet.2025.111330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"266 \",\"pages\":\"Article 111330\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S138912862500297X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S138912862500297X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
SIEVE: Generating a cybersecurity log dataset collection for SIEM event classification
Effective cyber threat monitoring relies on deploying robust Security Information and Event Management (SIEM) systems. SIEM applications receive security events generated by different devices, systems, and applications. They should properly correlate them to identify potential cyber threats based on tactics, techniques, and procedures (TTP), bypassing other security mechanisms (e.g., firewall, IDS, etc.). Given that logs are primarily generated to notify relevant system events and activities in a human-readable format, supervised Natural Language Processing (NLP) techniques could be used to train models that complement conventional parsing methodologies by automatically suggesting event classification into pre-defined categories. Training such models requires a substantial amount of pre-classified (labeled) data of different types to provide the learning patterns and nuances needed to make accurate predictions. Since the number of security event datasets is scarce due to privacy or availability reasons, and the few publicly available ones are often limited in terms of event diversity, number of labels, or simply unfit for the task at hand, an effective synthetic dataset for training SIEM-related machine learning event classification algorithms could be very useful. For these reasons, this paper proposes the generation of a synthetic dataset specifically designed to train SIEM systems for log-type classification. This research paper, starting from an in-depth methodological analysis of the prominent Cybersecurity related datasets available in the literature, introduces SIEVE (Siem Ingesting EVEnts), a synthetic dataset collection built from publicly available log samples using SPICE (Semantic Perturbation and Instantiation for Content Enrichment), a novel text augmentation and perturbation technique. SPICE is shown to be effective in generating realistic logs. Each instance of the dataset collection displays different levels of augmentation. Subsequent performance assessments were conducted through comprehensive benchmarking against various NLP classification models. Tests were conducted by training the classifiers using SIEVE and testing them on both the same SIEVE logs and real logs. The results of the experiments show that the best model among those tested is SVM (MaF1 0.9323 - 0.9737), which maintains its performance with slight degradation, even in tests on real logs (MaF1 0.9477 - 0.9636). BERT, on the other hand, performs better than SVM in most of the tests on SIEVE (MaF1 0.9528 - 0.9730) but does not show robustness when tested on real logs (MaF1 0.8864 - 0.9182).
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.