大型语言模型测试选择的评价与改进

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2025-10-08 DOI:10.1002/smr.70057

Lili Quan, Jin Wen, Qiang Hu, Maxime Cordy, Yuheng Huang, Lei Ma, Xiaohong Li

{"title":"大型语言模型测试选择的评价与改进","authors":"Lili Quan, Jin Wen, Qiang Hu, Maxime Cordy, Yuheng Huang, Lei Ma, Xiaohong Li","doi":"10.1002/smr.70057","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many <i>faults</i> still exist that LLMs cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLMs could face is important but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. We focus on classification tasks because most existing test selection methods target this setting and reliably estimating confidence scores for variable-length outputs in generative tasks remains challenging. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT-4) demonstrated that simple methods such as Margin perform well on LLMs, but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability for LLMs specifically on classification tasks. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 10","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation and Improvement of Test Selection for Large Language Models\",\"authors\":\"Lili Quan, Jin Wen, Qiang Hu, Maxime Cordy, Yuheng Huang, Lei Ma, Xiaohong Li\",\"doi\":\"10.1002/smr.70057\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many <i>faults</i> still exist that LLMs cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLMs could face is important but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. We focus on classification tasks because most existing test selection methods target this setting and reliably estimating confidence scores for variable-length outputs in generative tasks remains challenging. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT-4) demonstrated that simple methods such as Margin perform well on LLMs, but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability for LLMs specifically on classification tasks. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.</p>\\n </div>\",\"PeriodicalId\":48898,\"journal\":{\"name\":\"Journal of Software-Evolution and Process\",\"volume\":\"37 10\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Software-Evolution and Process\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/smr.70057\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70057","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）最近在不同的应用领域取得了显著的成功，获得了来自不同社区的大量关注。不幸的是，仍然存在许多llm无法正确预测的故障。这些故障将损害llm的可用性，并可能在自动驾驶系统等可靠性关键系统中引入安全问题。如何快速发现llm可能面临的真实数据集中的这些错误是很重要的，但也是具有挑战性的。主要原因是，地面真相是必要的，但考虑到时间和人力，数据标注过程是繁重的。为了解决这一问题，在传统的深度学习测试领域，人们提出了测试选择方法，通过对故障进行优先排序来有效地评估深度学习模型。然而，尽管它们很重要，这些方法对法学硕士的有用性尚不清楚，也没有得到充分的探索。在本文中，我们进行了第一次实证研究，以调查现有的测试选择方法对法学硕士的有效性。我们专注于分类任务，因为大多数现有的测试选择方法都针对这种设置，并且可靠地估计生成任务中可变长度输出的置信度分数仍然具有挑战性。在四种不同任务（包括代码任务和自然语言处理任务）和四种llm（如LLaMA3和GPT-4）上的实验结果表明，Margin等简单方法在llm上表现良好，但仍有很大的改进空间。在此基础上，我们进一步提出了基于突变的快速预测置信度平滑框架MuCS，以提高llm在分类任务上的测试选择能力。具体而言，提出了多种提示突变技术来帮助收集不同的输出以进行置信度平滑。结果表明，我们提出的框架显著提高了现有方法，测试相对覆盖率提高了70.53%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluation and Improvement of Test Selection for Large Language Models

查看原文本刊更多论文

Evaluation and Improvement of Test Selection for Large Language Models

Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many faults still exist that LLMs cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLMs could face is important but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. We focus on classification tasks because most existing test selection methods target this setting and reliably estimating confidence scores for variable-length outputs in generative tasks remains challenging. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT-4) demonstrated that simple methods such as Margin perform well on LLMs, but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability for LLMs specifically on classification tasks. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109