{"title":"Multi-label Classification of Cybersecurity Text with Distant Supervision","authors":"M. Ishii, K. Mori, Ryoichi Kuwana, S. Matsuura","doi":"10.1145/3538969.3543795","DOIUrl":null,"url":null,"abstract":"Detailed analysis of cybersecurity intelligence in various data is essential to counter the recent advanced and complex evolution of cyber security attacks and threats. In particular, highly sophisticated learning models are required to classify cyberattacks and threats or extract security intelligence from unstructured data described in natural language. This study addresses text classification as the first step toward such sophisticated models. More specifically, we performed a multi-label classification of cybersecurity documents to reduce the cost of threat analysis and incident response. Detailed analysis of security incidents requires an integrated model that performs security intelligence extraction and event extraction tasks that leverage their relationships. We performed document-level multi-label classification with the standard categories of MITRE for cybersecurity attack and threat models. Furthermore, to reduce the cost of creating a large set of annotated data to improve the accuracy of the model, we automated generating of training data by using distant supervision [18]. We compared some methods for extracting keywords obtained from texts related to a defined classification category and multiple label assignment rules. We used cybersecurity documents from social news sites, threat reports, blog articles posted by security vendors as training and test data. We train a multi-label classification model on these texts using their document-level embedding vector obtained from a pre-trained language model. We also reported the experimental classification result for each category and compare several models and labeling with distant supervision. In addition, we performed human annotation for the sampled documents in the test data and evaluated the accuracy of classification on the annotated data. We showed that the machine learning models are slightly more accurate than the rule-based classifying with distant supervision on the test data. In some cases, the classification accuracy of distant supervision labeling is higher than the machine learning model on the human-annotated data. Furthermore, we analyzed and discussed the statistics of labels assigned by distant supervision, their co-occurrence with the predicted categories by the trained model, and how to utilize the classification model in cybersecurity incident response.","PeriodicalId":306813,"journal":{"name":"Proceedings of the 17th International Conference on Availability, Reliability and Security","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 17th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3538969.3543795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Detailed analysis of cybersecurity intelligence in various data is essential to counter the recent advanced and complex evolution of cyber security attacks and threats. In particular, highly sophisticated learning models are required to classify cyberattacks and threats or extract security intelligence from unstructured data described in natural language. This study addresses text classification as the first step toward such sophisticated models. More specifically, we performed a multi-label classification of cybersecurity documents to reduce the cost of threat analysis and incident response. Detailed analysis of security incidents requires an integrated model that performs security intelligence extraction and event extraction tasks that leverage their relationships. We performed document-level multi-label classification with the standard categories of MITRE for cybersecurity attack and threat models. Furthermore, to reduce the cost of creating a large set of annotated data to improve the accuracy of the model, we automated generating of training data by using distant supervision [18]. We compared some methods for extracting keywords obtained from texts related to a defined classification category and multiple label assignment rules. We used cybersecurity documents from social news sites, threat reports, blog articles posted by security vendors as training and test data. We train a multi-label classification model on these texts using their document-level embedding vector obtained from a pre-trained language model. We also reported the experimental classification result for each category and compare several models and labeling with distant supervision. In addition, we performed human annotation for the sampled documents in the test data and evaluated the accuracy of classification on the annotated data. We showed that the machine learning models are slightly more accurate than the rule-based classifying with distant supervision on the test data. In some cases, the classification accuracy of distant supervision labeling is higher than the machine learning model on the human-annotated data. Furthermore, we analyzed and discussed the statistics of labels assigned by distant supervision, their co-occurrence with the predicted categories by the trained model, and how to utilize the classification model in cybersecurity incident response.