Bayes-based word weighting for enhanced vulnerability classification in critical infrastructure systems

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2025-03-25 DOI:10.1016/j.cose.2025.104451

Aissa Ben Yahya, Hicham El Akhal, El Mehdi Ismaili Alaoui, Abdelbaki El Belrhiti El Alaoui

{"title":"Bayes-based word weighting for enhanced vulnerability classification in critical infrastructure systems","authors":"Aissa Ben Yahya, Hicham El Akhal, El Mehdi Ismaili Alaoui, Abdelbaki El Belrhiti El Alaoui","doi":"10.1016/j.cose.2025.104451","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing number of vulnerabilities in embedded devices poses a significant threat to the critical infrastructure security where these devices are used. While deep learning approaches have advanced software vulnerability classification, they exhibit critical limitations regarding word weighting. Conventional methods like term frequency–inverse document frequency (TF–IDF) prioritize global term distributions but overlook intra-class distinctions. While improved variants of this technique have been proposed, they often fail to consider that a word’s importance can vary across categories and struggle to prioritize rare but distinctive words adequately. Additionally, high inter-class semantic overlap and terminological ambiguity in vulnerability descriptions hinder model performance by failing to separate intra-class keywords From background noise. to address these gaps, we propose a novel vulnerability classification and word vector weighting approach based on bayes theorem. our method dynamically adjusts term relevance by calculating posterior probabilities of word-category associations, emphasizing rare tokens with high intra-class specificity. we validate the approach on four test datasets derived from databases such as the national vulnerability database (NVD) and the chinese vulnerability database (CNNVD). rigorous ablation and comparative studies demonstrate that bayes-based word weighting outperformed other methods by achieving a performance of 97.63% accuracy, and 97.60% F1-score on the most challenging test data. all our models and code to produce our results are open-sourced.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"154 ","pages":"Article 104451"},"PeriodicalIF":4.8000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825001403","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing number of vulnerabilities in embedded devices poses a significant threat to the critical infrastructure security where these devices are used. While deep learning approaches have advanced software vulnerability classification, they exhibit critical limitations regarding word weighting. Conventional methods like term frequency–inverse document frequency (TF–IDF) prioritize global term distributions but overlook intra-class distinctions. While improved variants of this technique have been proposed, they often fail to consider that a word’s importance can vary across categories and struggle to prioritize rare but distinctive words adequately. Additionally, high inter-class semantic overlap and terminological ambiguity in vulnerability descriptions hinder model performance by failing to separate intra-class keywords From background noise. to address these gaps, we propose a novel vulnerability classification and word vector weighting approach based on bayes theorem. our method dynamically adjusts term relevance by calculating posterior probabilities of word-category associations, emphasizing rare tokens with high intra-class specificity. we validate the approach on four test datasets derived from databases such as the national vulnerability database (NVD) and the chinese vulnerability database (CNNVD). rigorous ablation and comparative studies demonstrate that bayes-based word weighting outperformed other methods by achieving a performance of 97.63% accuracy, and 97.60% F1-score on the most challenging test data. all our models and code to produce our results are open-sourced.

查看原文本刊更多论文

基于贝叶斯的词加权增强关键基础设施系统漏洞分类

嵌入式设备中越来越多的漏洞对使用这些设备的关键基础设施的安全性构成了重大威胁。虽然深度学习方法具有先进的软件漏洞分类，但它们在单词权重方面表现出严重的局限性。传统的方法，如词频逆文档频率（TF-IDF）优先考虑全局词分布，但忽略了类内的差异。虽然已经提出了这种方法的改进变体，但它们往往没有考虑到一个词的重要性在不同的类别中是不同的，并且很难充分地优先考虑罕见但独特的词。此外，漏洞描述中的类间语义高度重叠和术语模糊，无法将类内关键字从背景噪声中分离出来，影响了模型的性能。为了解决这些问题，我们提出了一种新的基于贝叶斯定理的漏洞分类和词向量加权方法。我们的方法通过计算词类关联的后验概率来动态调整术语相关性，强调具有高类内特异性的稀有标记。我们在国家漏洞数据库（NVD）和中国漏洞数据库（CNNVD）的四个测试数据集上验证了该方法。严谨的研究和对比研究表明，基于贝叶斯的词加权方法优于其他方法，在最具挑战性的测试数据上，准确率达到97.63%，f1得分达到97.60%。我们所有生成结果的模型和代码都是开源的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.