Multi-label Classification of Cybersecurity Text with Distant Supervision

Proceedings of the 17th International Conference on Availability, Reliability and Security Pub Date : 2022-08-23 DOI:10.1145/3538969.3543795

M. Ishii, K. Mori, Ryoichi Kuwana, S. Matsuura

{"title":"Multi-label Classification of Cybersecurity Text with Distant Supervision","authors":"M. Ishii, K. Mori, Ryoichi Kuwana, S. Matsuura","doi":"10.1145/3538969.3543795","DOIUrl":null,"url":null,"abstract":"Detailed analysis of cybersecurity intelligence in various data is essential to counter the recent advanced and complex evolution of cyber security attacks and threats. In particular, highly sophisticated learning models are required to classify cyberattacks and threats or extract security intelligence from unstructured data described in natural language. This study addresses text classification as the first step toward such sophisticated models. More specifically, we performed a multi-label classification of cybersecurity documents to reduce the cost of threat analysis and incident response. Detailed analysis of security incidents requires an integrated model that performs security intelligence extraction and event extraction tasks that leverage their relationships. We performed document-level multi-label classification with the standard categories of MITRE for cybersecurity attack and threat models. Furthermore, to reduce the cost of creating a large set of annotated data to improve the accuracy of the model, we automated generating of training data by using distant supervision [18]. We compared some methods for extracting keywords obtained from texts related to a defined classification category and multiple label assignment rules. We used cybersecurity documents from social news sites, threat reports, blog articles posted by security vendors as training and test data. We train a multi-label classification model on these texts using their document-level embedding vector obtained from a pre-trained language model. We also reported the experimental classification result for each category and compare several models and labeling with distant supervision. In addition, we performed human annotation for the sampled documents in the test data and evaluated the accuracy of classification on the annotated data. We showed that the machine learning models are slightly more accurate than the rule-based classifying with distant supervision on the test data. In some cases, the classification accuracy of distant supervision labeling is higher than the machine learning model on the human-annotated data. Furthermore, we analyzed and discussed the statistics of labels assigned by distant supervision, their co-occurrence with the predicted categories by the trained model, and how to utilize the classification model in cybersecurity incident response.","PeriodicalId":306813,"journal":{"name":"Proceedings of the 17th International Conference on Availability, Reliability and Security","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 17th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3538969.3543795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Detailed analysis of cybersecurity intelligence in various data is essential to counter the recent advanced and complex evolution of cyber security attacks and threats. In particular, highly sophisticated learning models are required to classify cyberattacks and threats or extract security intelligence from unstructured data described in natural language. This study addresses text classification as the first step toward such sophisticated models. More specifically, we performed a multi-label classification of cybersecurity documents to reduce the cost of threat analysis and incident response. Detailed analysis of security incidents requires an integrated model that performs security intelligence extraction and event extraction tasks that leverage their relationships. We performed document-level multi-label classification with the standard categories of MITRE for cybersecurity attack and threat models. Furthermore, to reduce the cost of creating a large set of annotated data to improve the accuracy of the model, we automated generating of training data by using distant supervision [18]. We compared some methods for extracting keywords obtained from texts related to a defined classification category and multiple label assignment rules. We used cybersecurity documents from social news sites, threat reports, blog articles posted by security vendors as training and test data. We train a multi-label classification model on these texts using their document-level embedding vector obtained from a pre-trained language model. We also reported the experimental classification result for each category and compare several models and labeling with distant supervision. In addition, we performed human annotation for the sampled documents in the test data and evaluated the accuracy of classification on the annotated data. We showed that the machine learning models are slightly more accurate than the rule-based classifying with distant supervision on the test data. In some cases, the classification accuracy of distant supervision labeling is higher than the machine learning model on the human-annotated data. Furthermore, we analyzed and discussed the statistics of labels assigned by distant supervision, their co-occurrence with the predicted categories by the trained model, and how to utilize the classification model in cybersecurity incident response.

查看原文本刊更多论文

远程监督下网络安全文本的多标签分类

对各种数据中的网络安全情报进行详细分析，对于应对近期网络安全攻击和威胁的先进和复杂演变至关重要。特别是，需要高度复杂的学习模型来对网络攻击和威胁进行分类，或者从以自然语言描述的非结构化数据中提取安全情报。本研究将文本分类作为迈向这种复杂模型的第一步。更具体地说，我们对网络安全文档进行了多标签分类，以降低威胁分析和事件响应的成本。安全事件的详细分析需要一个集成的模型，该模型可以执行安全情报提取和利用它们之间关系的事件提取任务。我们使用MITRE的标准类别对网络安全攻击和威胁模型进行了文档级多标签分类。此外，为了降低创建大量带注释数据的成本以提高模型的准确性，我们使用远程监督[18]自动生成训练数据。我们比较了从与定义的分类类别和多个标签分配规则相关的文本中提取关键字的几种方法。我们使用来自社会新闻网站的网络安全文档、威胁报告、安全供应商发布的博客文章作为培训和测试数据。我们使用从预训练的语言模型中获得的文档级嵌入向量对这些文本训练多标签分类模型。我们还报告了每个类别的实验分类结果，并比较了几种模型和远程监督的标记。此外，我们对测试数据中的采样文档进行了人工标注，并对标注数据的分类准确性进行了评估。我们表明，机器学习模型比基于规则的分类在测试数据上的远程监督稍微准确一些。在某些情况下，远程监督标注的分类准确率高于机器学习模型对人工标注数据的分类准确率。此外，我们还分析和讨论了远程监督分配的标签的统计性、它们与训练模型预测的类别的共现性，以及如何将分类模型应用于网络安全事件响应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 17th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量