网络安全NER数据标注的强化主动学习抽样

2022 OITS International Conference on Information Technology (OCIT) Pub Date : 2022-12-01 DOI:10.1109/OCIT56763.2022.00066

Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo

{"title":"网络安全NER数据标注的强化主动学习抽样","authors":"Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo","doi":"10.1109/OCIT56763.2022.00066","DOIUrl":null,"url":null,"abstract":"A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.","PeriodicalId":425541,"journal":{"name":"2022 OITS International Conference on Information Technology (OCIT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Reinforced Active Learning Sampling for Cybersecurity NER Data Annotation\",\"authors\":\"Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo\",\"doi\":\"10.1109/OCIT56763.2022.00066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.\",\"PeriodicalId\":425541,\"journal\":{\"name\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/OCIT56763.2022.00066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 OITS International Conference on Information Technology (OCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/OCIT56763.2022.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

绝大多数网络安全数据以非结构化文本数据的形式出现，需要熟练地注释以训练有监督的机器学习模型。关键的问题是，在预算限制下，为了获得更好的模型性能，应该对数据的多少和哪个子集进行注释。虽然大多数机器学习(ML)研究都侧重于使用带注释的数据集学习更好的模型，但本文关注的是数据注释，特别是合适的子集选择，重点是网络安全的命名实体识别(NER)。该方法提供了一种基于主动学习的采样策略，从大集合中选择最小但信息量最大的样本。此外，将强化学习与主动学习方法相结合，实现采样过程的自动化。在自动标记的cyber-NER数据集上的结果表明，与基于随机抽样的子集选择相比，基于增强主动学习(RAL)采样的cyber-NER模型将F1-Score提高了+2-7%，并减少了90%的计算时间。此外，建议的RAL方法实现了样本量减少80%，因此，注释成本与完全选择的准确度相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Reinforced Active Learning Sampling for Cybersecurity NER Data Annotation

A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 OITS International Conference on Information Technology (OCIT)

自引率

0.00%

发文量