A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.

Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor
{"title":"A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.","authors":"Paul K Mvula,&nbsp;Paula Branco,&nbsp;Guy-Vincent Jourdan,&nbsp;Herna L Viktor","doi":"10.1007/s44248-023-00003-x","DOIUrl":null,"url":null,"abstract":"<p><p>In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.</p>","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"1 1","pages":"4"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10079755/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Discover data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44248-023-00003-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.

Abstract Image

Abstract Image

Abstract Image

对网络安全数据存储库和半监督学习绩效评估指标的系统文献综述。
在机器学习中,用于构建模型的数据集是限制这些模型实现的主要因素之一,以及它们的预测性能有多好。机器学习在网络安全或计算机安全方面的应用有很多,包括通过模式识别、实时攻击检测和深入渗透测试来缓解网络威胁和增强安全基础设施。因此,特别是对于这些应用程序,必须仔细考虑用于构建模型的数据集是否代表真实世界的数据。然而,由于标记数据的稀缺性和手动标记正例的成本,越来越多的文献利用网络安全数据存储库的半监督学习。在这项工作中,我们提供了一个全面的概述,用于构建基于半监督学习的计算机安全或网络安全系统的公开可用的数据存储库和数据集,其中只有少数标签是必要的或可用于构建强模型。我们强调了数据存储库和数据集的优势和局限性,并提供了用于评估构建模型的性能评估指标的分析。最后,我们讨论了开放的挑战,并为使用网络安全数据集和评估基于它们的模型提供了未来的研究方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信