NG_MDERANK: A software vulnerability feature knowledge extraction method based on N-gram similarity

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2024-08-27 DOI:10.1002/smr.2727

Xiaoxue Wu, Shiyu Weng, Bin Zheng, Wei Zheng, Xiang Chen, Xiaobin Sun

{"title":"NG_MDERANK: A software vulnerability feature knowledge extraction method based on N-gram similarity","authors":"Xiaoxue Wu, Shiyu Weng, Bin Zheng, Wei Zheng, Xiang Chen, Xiaobin Sun","doi":"10.1002/smr.2727","DOIUrl":null,"url":null,"abstract":"<p>As software grows in size and complexity, software vulnerabilities are increasing, leading to a range of serious insecurity issues. Open-source software vulnerability reports and documentation can provide researchers with great convenience for analysis and detection. However, the quality of different data sources varies, the data are duplicated and lack of correlation, which often requires a lot of manual management and analysis. In order to solve the problems of scattered and heterogeneous data and lack of correlation in traditional vulnerability repositories, this paper proposes a software vulnerability feature knowledge extraction method that combines the N-gram model and mask similarity. The method generates mask text data based on the extraction of N-gram candidate keywords and extracts vulnerability feature knowledge by calculating the similarity of mask text. This method analyzes the samples efficiently and stably in the environment of large sample size and complex samples and can obtain high-value semi-structured data. Then, the final node, relationship, and attribute information are obtained by secondary knowledge cleaning and extraction of the extracted semi-structured data results. And based on the extraction results, the corresponding software vulnerability domain knowledge graph is constructed to deeply explore the semantic information features and entity relationships of vulnerabilities, which can help to efficiently study software security problems and solve vulnerability problems. The effectiveness and superiority of the proposed method is verified by comparing it with several traditional keyword extraction algorithms on Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) vulnerability data.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.2727","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

As software grows in size and complexity, software vulnerabilities are increasing, leading to a range of serious insecurity issues. Open-source software vulnerability reports and documentation can provide researchers with great convenience for analysis and detection. However, the quality of different data sources varies, the data are duplicated and lack of correlation, which often requires a lot of manual management and analysis. In order to solve the problems of scattered and heterogeneous data and lack of correlation in traditional vulnerability repositories, this paper proposes a software vulnerability feature knowledge extraction method that combines the N-gram model and mask similarity. The method generates mask text data based on the extraction of N-gram candidate keywords and extracts vulnerability feature knowledge by calculating the similarity of mask text. This method analyzes the samples efficiently and stably in the environment of large sample size and complex samples and can obtain high-value semi-structured data. Then, the final node, relationship, and attribute information are obtained by secondary knowledge cleaning and extraction of the extracted semi-structured data results. And based on the extraction results, the corresponding software vulnerability domain knowledge graph is constructed to deeply explore the semantic information features and entity relationships of vulnerabilities, which can help to efficiently study software security problems and solve vulnerability problems. The effectiveness and superiority of the proposed method is verified by comparing it with several traditional keyword extraction algorithms on Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) vulnerability data.

Abstract Image

查看原文本刊更多论文

NG_MDERANK：基于 N-gram 相似性的软件漏洞特征知识提取方法

随着软件规模和复杂性的增长，软件漏洞也在不断增加，导致了一系列严重的不安全问题。开源软件漏洞报告和文档可以为研究人员的分析和检测提供极大的便利。然而，不同数据源的质量参差不齐，数据重复且缺乏关联性，往往需要大量的人工管理和分析。为了解决传统漏洞库中数据分散、异构、缺乏关联性等问题，本文提出了一种结合 N-gram 模型和掩码相似性的软件漏洞特征知识提取方法。该方法在提取 N-gram 候选关键词的基础上生成掩码文本数据，并通过计算掩码文本的相似度提取漏洞特征知识。该方法能在样本量大、样本复杂的环境下高效、稳定地分析样本，并能获得高价值的半结构化数据。然后，通过对提取的半结构化数据结果进行二次知识清洗和提取，得到最终的节点、关系和属性信息。并根据提取结果构建相应的软件漏洞领域知识图谱，深入挖掘漏洞的语义信息特征和实体关系，有助于高效地研究软件安全问题和解决漏洞问题。通过在常见弱点枚举（CWE）和常见漏洞与暴露（CVE）漏洞数据上与几种传统关键词提取算法的比较，验证了所提方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109