VuldiffFinder：发现非结构化漏洞信息中的不一致性

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2025-03-29 DOI:10.1016/j.cose.2025.104447

Qindong Li , Wenyi Tang , Xingshu Chen , Hao Ren

{"title":"VuldiffFinder：发现非结构化漏洞信息中的不一致性","authors":"Qindong Li , Wenyi Tang , Xingshu Chen , Hao Ren","doi":"10.1016/j.cose.2025.104447","DOIUrl":null,"url":null,"abstract":"<div><div>The information conveyed by vulnerability reports is crucial for enhancing the security of information systems. Nonetheless, there are widespread information inconsistencies across reports, including, numerical discrepancies, misreported version ranges, semantic conflict, and so on. Identifying these inconsistencies is essential for improving information quality. Current research primarily focuses on standardized, non-free-form information’s inconsistency at the character or numerical level, while research for unstructured ones at the semantic level is limited. Given this, we introduce Vul<sub>diff</sub>Finder to determine the inconsistency of unstructured vulnerability information at the semantic level. Firstly, it utilizes NLP tools to break down unstructured information into constituent sets, and design a determination strategy based on the constituent’s syntactic hierarchies and semantic similarity. The designed strategy can determine information pairs in arbitrary structure. Secondly, it creates a span similarity-based fine-tuning task to enhance the embedding capabilities of the SpanBERT model, ensuring accurately capturing semantic information in the vulnerability domain. Finally, a dataset containing eight categories of vulnerability information and 1,612 samples is utilized to validate the proposed method. The results demonstrate that Vul<sub>diff</sub>Finder outperforms the state-of-the-art schemes, showing a 4.31% improvement in the F1-score. Additionally, we discover that consistency is higher in information that has simpler writing structures (up to 56.46%). Heterogeneous and Contained are often found in information with fixed or complex writing structures (up to 23.33% and 38.30%, respectively). Divergent and Repugnant mainly occur in information with a high missing rate.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"154 ","pages":"Article 104447"},"PeriodicalIF":4.8000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VuldiffFinder: Discovering inconsistencies in unstructured vulnerability information\",\"authors\":\"Qindong Li , Wenyi Tang , Xingshu Chen , Hao Ren\",\"doi\":\"10.1016/j.cose.2025.104447\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The information conveyed by vulnerability reports is crucial for enhancing the security of information systems. Nonetheless, there are widespread information inconsistencies across reports, including, numerical discrepancies, misreported version ranges, semantic conflict, and so on. Identifying these inconsistencies is essential for improving information quality. Current research primarily focuses on standardized, non-free-form information’s inconsistency at the character or numerical level, while research for unstructured ones at the semantic level is limited. Given this, we introduce Vul<sub>diff</sub>Finder to determine the inconsistency of unstructured vulnerability information at the semantic level. Firstly, it utilizes NLP tools to break down unstructured information into constituent sets, and design a determination strategy based on the constituent’s syntactic hierarchies and semantic similarity. The designed strategy can determine information pairs in arbitrary structure. Secondly, it creates a span similarity-based fine-tuning task to enhance the embedding capabilities of the SpanBERT model, ensuring accurately capturing semantic information in the vulnerability domain. Finally, a dataset containing eight categories of vulnerability information and 1,612 samples is utilized to validate the proposed method. The results demonstrate that Vul<sub>diff</sub>Finder outperforms the state-of-the-art schemes, showing a 4.31% improvement in the F1-score. Additionally, we discover that consistency is higher in information that has simpler writing structures (up to 56.46%). Heterogeneous and Contained are often found in information with fixed or complex writing structures (up to 23.33% and 38.30%, respectively). Divergent and Repugnant mainly occur in information with a high missing rate.</div></div>\",\"PeriodicalId\":51004,\"journal\":{\"name\":\"Computers & Security\",\"volume\":\"154 \",\"pages\":\"Article 104447\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167404825001361\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825001361","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

漏洞报告所传达的信息对加强信息系统的安全性至关重要。尽管如此，报告中存在广泛的信息不一致，包括数字差异、错误报告的版本范围、语义冲突等等。识别这些不一致对于提高信息质量至关重要。目前的研究主要集中在标准化的、非自由形式的信息在字符或数字层面上的不一致，而对非结构化信息在语义层面上的不一致研究较少。鉴于此，我们引入了VuldiffFinder来确定语义级非结构化漏洞信息的不一致性。首先，利用自然语言处理工具将非结构化信息分解为成分集，并基于成分的句法层次和语义相似度设计确定策略；所设计的策略可以确定任意结构中的信息对。其次，创建基于跨度相似度的微调任务，增强SpanBERT模型的嵌入能力，确保准确捕获漏洞域的语义信息；最后，利用包含8类漏洞信息和1612个样本的数据集对该方法进行验证。结果表明，VuldiffFinder优于最先进的方案，f1分数提高了4.31%。此外，我们发现具有更简单的书写结构的信息的一致性更高（高达56.46%）。“Heterogeneous”和“Contained”常见于固定或复杂写作结构的信息中（分别高达23.33%和38.30%）。分歧和抵触主要发生在缺失率高的信息中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VuldiffFinder: Discovering inconsistencies in unstructured vulnerability information

The information conveyed by vulnerability reports is crucial for enhancing the security of information systems. Nonetheless, there are widespread information inconsistencies across reports, including, numerical discrepancies, misreported version ranges, semantic conflict, and so on. Identifying these inconsistencies is essential for improving information quality. Current research primarily focuses on standardized, non-free-form information’s inconsistency at the character or numerical level, while research for unstructured ones at the semantic level is limited. Given this, we introduce Vul_diffFinder to determine the inconsistency of unstructured vulnerability information at the semantic level. Firstly, it utilizes NLP tools to break down unstructured information into constituent sets, and design a determination strategy based on the constituent’s syntactic hierarchies and semantic similarity. The designed strategy can determine information pairs in arbitrary structure. Secondly, it creates a span similarity-based fine-tuning task to enhance the embedding capabilities of the SpanBERT model, ensuring accurately capturing semantic information in the vulnerability domain. Finally, a dataset containing eight categories of vulnerability information and 1,612 samples is utilized to validate the proposed method. The results demonstrate that Vul_diffFinder outperforms the state-of-the-art schemes, showing a 4.31% improvement in the F1-score. Additionally, we discover that consistency is higher in information that has simpler writing structures (up to 56.46%). Heterogeneous and Contained are often found in information with fixed or complex writing structures (up to 23.33% and 38.30%, respectively). Divergent and Repugnant mainly occur in information with a high missing rate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.