Efficient Clustering of Software Vulnerabilities using Self Organizing Map (SOM)

Khyati Panchal, Siddhartha Shankar Das, Luis De La Torre, John Miller, R. Rallo, M. Halappanavar
{"title":"Efficient Clustering of Software Vulnerabilities using Self Organizing Map (SOM)","authors":"Khyati Panchal, Siddhartha Shankar Das, Luis De La Torre, John Miller, R. Rallo, M. Halappanavar","doi":"10.1109/HST56032.2022.10025443","DOIUrl":null,"url":null,"abstract":"The common vulnerabilities and exposures (CVE) database was created with a mission to “identify, define, and catalog publicly disclosed cybersecurity vulnerabilities”. This rich body of information can be used to enable rapid and efficient response to secure and defend cyber operations and protect critical cyber infrastructure. The main goal of this paper is to develop a visual analytic tool to enable deep analysis of CVEs using unsupervised clustering techniques. We enhance our analysis by first mapping CVEs to hierarchical-classes in Common Weakness Enumeration (CWE) using information in the National Vulnerability Database (NVD). Both the mapping and the numerical representation of CVEs are enabled by V2W-BERT, which uses natural language processing of the extensive information in NVD to generate a large tabular database of 137,226 CVE entries from 1999 to 2020, where each CVE is represented by a vector of 768 numerical features. The vectorized data is processed by Self-Organizing Maps (SOM), which is an unsupervised machine learning technique for dimensionality reduction, visual representation and clustering. Using a Torus map of 6417 units, we achieve 10-fold data compression of 140k CVEs using SOM. The trained map is further clustered using standard K-means clustering into 138 clusters of CVEs. We conducted a brief investigation of the rich mapping of CVEs to best-matching-units to K-means clusters, as well as CVEs to CWEs. For example, this novel mapping provided insight into the role of CWE-59 and CWE-264 in several CVEs that is otherwise hard to explore in the original data. We conclude that our this novel approach will not only enable deep analysis of the complex relationships between CVEs and CWEs, but also a mechanism to quickly respond to and design mitigation actions for rapidly evolving vulnerabilities that have not been mapped to existing CWEs.","PeriodicalId":162426,"journal":{"name":"2022 IEEE International Symposium on Technologies for Homeland Security (HST)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Technologies for Homeland Security (HST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HST56032.2022.10025443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The common vulnerabilities and exposures (CVE) database was created with a mission to “identify, define, and catalog publicly disclosed cybersecurity vulnerabilities”. This rich body of information can be used to enable rapid and efficient response to secure and defend cyber operations and protect critical cyber infrastructure. The main goal of this paper is to develop a visual analytic tool to enable deep analysis of CVEs using unsupervised clustering techniques. We enhance our analysis by first mapping CVEs to hierarchical-classes in Common Weakness Enumeration (CWE) using information in the National Vulnerability Database (NVD). Both the mapping and the numerical representation of CVEs are enabled by V2W-BERT, which uses natural language processing of the extensive information in NVD to generate a large tabular database of 137,226 CVE entries from 1999 to 2020, where each CVE is represented by a vector of 768 numerical features. The vectorized data is processed by Self-Organizing Maps (SOM), which is an unsupervised machine learning technique for dimensionality reduction, visual representation and clustering. Using a Torus map of 6417 units, we achieve 10-fold data compression of 140k CVEs using SOM. The trained map is further clustered using standard K-means clustering into 138 clusters of CVEs. We conducted a brief investigation of the rich mapping of CVEs to best-matching-units to K-means clusters, as well as CVEs to CWEs. For example, this novel mapping provided insight into the role of CWE-59 and CWE-264 in several CVEs that is otherwise hard to explore in the original data. We conclude that our this novel approach will not only enable deep analysis of the complex relationships between CVEs and CWEs, but also a mechanism to quickly respond to and design mitigation actions for rapidly evolving vulnerabilities that have not been mapped to existing CWEs.
基于SOM的软件漏洞高效聚类
创建通用漏洞和暴露(CVE)数据库的目的是“识别、定义和编目公开披露的网络安全漏洞”。这些丰富的信息可用于实现快速有效的响应,以保护和保护网络行动,并保护关键的网络基础设施。本文的主要目标是开发一种可视化分析工具,使用无监督聚类技术对cve进行深度分析。我们首先使用国家漏洞数据库(NVD)中的信息将cve映射到公共弱点枚举(CWE)中的层次类,从而增强了我们的分析。CVE的映射和数值表示都是由V2W-BERT实现的,它使用NVD中广泛信息的自然语言处理来生成一个大型表格数据库,从1999年到2020年,包含137,226个CVE条目,其中每个CVE由768个数字特征向量表示。矢量化的数据通过自组织地图(SOM)处理,SOM是一种无监督的机器学习技术,用于降维、视觉表示和聚类。使用6417个单元的环面图,我们使用SOM实现了140k cve的10倍数据压缩。使用标准K-means聚类将训练好的地图进一步聚为138个cve聚类。我们对cve到最佳匹配单元到K-means聚类的丰富映射,以及cve到CWEs的丰富映射进行了简要的研究。例如,这种新颖的映射提供了对CWE-59和CWE-264在几个cve中的作用的深入了解,否则在原始数据中很难探索。我们的结论是,我们的这种新方法不仅可以深入分析cve和CWEs之间的复杂关系,而且还可以建立一种机制,对尚未映射到现有CWEs的快速演变的漏洞进行快速响应和设计缓解措施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信