Explaining poor performance of text-based machine learning models for vulnerability detection

IF 3.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering Pub Date : 2024-07-22 DOI:10.1007/s10664-024-10519-8

Kollin Napier, Tanmay Bhowmik, Zhiqian Chen

{"title":"Explaining poor performance of text-based machine learning models for vulnerability detection","authors":"Kollin Napier, Tanmay Bhowmik, Zhiqian Chen","doi":"10.1007/s10664-024-10519-8","DOIUrl":null,"url":null,"abstract":"<p>With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"36 1","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10519-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.

Abstract Image

查看原文本刊更多论文

解释基于文本的机器学习模型在漏洞检测中表现不佳的原因

随着软件漏洞日益严重，人们开始采用机器学习模型来应对这一威胁。鉴于使用此类模型的可能性，该领域的研究引入了各种方法。虽然模型的性能可能各不相同，但在理解模型如何学习和预测方面总体上缺乏可解释性。此外，最近的研究表明，在将源代码解释为文本（即 "基于文本 "的模型）时，模型在检测漏洞方面表现不佳。为了帮助解释这种糟糕的表现，我们探讨了可解释性的各个维度。根据最近对基于文本的模型的研究，我们尝试删除训练数据集和测试数据集中存在的重叠特征，这些特征被视为 "交叉特征"。我们进行了场景实验，移除此类 "交叉 "数据并重新评估模型性能。根据实验结果，我们研究了移除这些 "交叉 "特征对模型性能的影响。我们的结果表明，移除 "交叉 "特征可能会提高模型的总体性能，从而导致有关数据依赖性和不可知论模型的可解释维度。总之，我们得出的结论是，模型的性能是可以提高的，而且可以通过对模型性能的实证分析来确定这些模型的可解释方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.