Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo
{"title":"网络安全NER数据标注的强化主动学习抽样","authors":"Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo","doi":"10.1109/OCIT56763.2022.00066","DOIUrl":null,"url":null,"abstract":"A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.","PeriodicalId":425541,"journal":{"name":"2022 OITS International Conference on Information Technology (OCIT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Reinforced Active Learning Sampling for Cybersecurity NER Data Annotation\",\"authors\":\"Smita Srivastava, Deepa Gupta, Biswajit Paul, S. Sahoo\",\"doi\":\"10.1109/OCIT56763.2022.00066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.\",\"PeriodicalId\":425541,\"journal\":{\"name\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/OCIT56763.2022.00066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 OITS International Conference on Information Technology (OCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/OCIT56763.2022.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Reinforced Active Learning Sampling for Cybersecurity NER Data Annotation
A vast majority of cybersecurity data comes in the form of unstructured textual data and needs to be annotated proficiently to train supervised machine learning models. The critical question is how much and which subset of data should be annotated for better model performance under budget constraints. Though most of the Machine Learning (ML) research focuses on learning better models using annotated datasets, this paper focuses on data annotation, specifically on suitable subset selection with an emphasis on Named Entity Recognition (NER) for cybersecurity. The proposed method provides an active learning based sampling strategy to select minimal yet most informative samples from a large set. Further, reinforcement learning is combined with the active learning approach to automate the process of sampling. The results on the auto-labelled cyber-NER dataset indicate that the cyber-NER model with Reinforced Active Learning (RAL) based sampling increases F1-Score by +2-7% and reduces compute time by 90% compared to random sampling based subset selection. Further, the proposed RAL approach achieved an 80% reduction in sample size and, consequently, annotation cost with comparable accuracy to that of complete selection.