非平衡表安全数据的半监督方法

IF 0.9 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Computer Security Pub Date : 2023-11-10 DOI:10.3233/jcs-220130

Xiaodi Li, Latifur Khan, Mahmoud Zamani, Shamila Wickramasuriya, Kevin Hamlen, Bhavani Thuraisingham

{"title":"非平衡表安全数据的半监督方法","authors":"Xiaodi Li, Latifur Khan, Mahmoud Zamani, Shamila Wickramasuriya, Kevin Hamlen, Bhavani Thuraisingham","doi":"10.3233/jcs-220130","DOIUrl":null,"url":null,"abstract":"Con2Mix (Contrastive Double Mixup) is a new semi-supervised learning methodology that innovates a triplet mixup data augmentation approach for finding code vulnerabilities in imbalanced, tabular security data sets. Tabular data sets in cybersecurity domains are widely known to pose challenges for machine learning because of their heavily imbalanced data (e.g., a small number of labeled attack samples buried in a sea of mostly benign, unlabeled data). Semi-supervised learning leverages a small subset of labeled data and a large subset of unlabeled data to train a learning model. While semi-supervised methods have been well studied in image and language domains, in security domains they remain underutilized, especially on tabular security data sets which pose especially difficult contextual information loss and balance challenges for machine learning. Experiments applying Con2Mix to collected security data sets show promise for addressing these challenges, achieving state-of-the-art performance on two evaluated data sets compared with other methods.","PeriodicalId":46074,"journal":{"name":"Journal of Computer Security","volume":"77 21","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Con2Mix: A semi-supervised method for imbalanced tabular security data1\",\"authors\":\"Xiaodi Li, Latifur Khan, Mahmoud Zamani, Shamila Wickramasuriya, Kevin Hamlen, Bhavani Thuraisingham\",\"doi\":\"10.3233/jcs-220130\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Con2Mix (Contrastive Double Mixup) is a new semi-supervised learning methodology that innovates a triplet mixup data augmentation approach for finding code vulnerabilities in imbalanced, tabular security data sets. Tabular data sets in cybersecurity domains are widely known to pose challenges for machine learning because of their heavily imbalanced data (e.g., a small number of labeled attack samples buried in a sea of mostly benign, unlabeled data). Semi-supervised learning leverages a small subset of labeled data and a large subset of unlabeled data to train a learning model. While semi-supervised methods have been well studied in image and language domains, in security domains they remain underutilized, especially on tabular security data sets which pose especially difficult contextual information loss and balance challenges for machine learning. Experiments applying Con2Mix to collected security data sets show promise for addressing these challenges, achieving state-of-the-art performance on two evaluated data sets compared with other methods.\",\"PeriodicalId\":46074,\"journal\":{\"name\":\"Journal of Computer Security\",\"volume\":\"77 21\",\"pages\":\"0\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/jcs-220130\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jcs-220130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

Con2Mix(对比双重混合)是一种新的半监督学习方法，它创新了一种三重混合数据增强方法，用于在不平衡的表格安全数据集中发现代码漏洞。众所周知，网络安全领域的表格数据集对机器学习构成挑战，因为它们的数据严重不平衡(例如，少量标记的攻击样本被埋在大多数良性的、未标记的数据中)。半监督学习利用一小部分标记数据和大量未标记数据来训练学习模型。虽然半监督方法已经在图像和语言领域得到了很好的研究，但在安全领域，它们仍然没有得到充分利用，特别是在表格安全数据集上，这给机器学习带来了特别困难的上下文信息丢失和平衡挑战。将Con2Mix应用于收集的安全数据集的实验表明，与其他方法相比，Con2Mix在两个评估数据集上实现了最先进的性能，有望解决这些挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Con2Mix: A semi-supervised method for imbalanced tabular security data1

Con2Mix (Contrastive Double Mixup) is a new semi-supervised learning methodology that innovates a triplet mixup data augmentation approach for finding code vulnerabilities in imbalanced, tabular security data sets. Tabular data sets in cybersecurity domains are widely known to pose challenges for machine learning because of their heavily imbalanced data (e.g., a small number of labeled attack samples buried in a sea of mostly benign, unlabeled data). Semi-supervised learning leverages a small subset of labeled data and a large subset of unlabeled data to train a learning model. While semi-supervised methods have been well studied in image and language domains, in security domains they remain underutilized, especially on tabular security data sets which pose especially difficult contextual information loss and balance challenges for machine learning. Experiments applying Con2Mix to collected security data sets show promise for addressing these challenges, achieving state-of-the-art performance on two evaluated data sets compared with other methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Security COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

1.70

自引率

0.00%

发文量

期刊介绍： The Journal of Computer Security presents research and development results of lasting significance in the theory, design, implementation, analysis, and application of secure computer systems and networks. It will also provide a forum for ideas about the meaning and implications of security and privacy, particularly those with important consequences for the technical community. The Journal provides an opportunity to publish articles of greater depth and length than is possible in the proceedings of various existing conferences, while addressing an audience of researchers in computer security who can be assumed to have a more specialized background than the readership of other archival publications.