Lili Quan, Jin Wen, Qiang Hu, Maxime Cordy, Yuheng Huang, Lei Ma, Xiaohong Li
{"title":"Evaluation and Improvement of Test Selection for Large Language Models","authors":"Lili Quan, Jin Wen, Qiang Hu, Maxime Cordy, Yuheng Huang, Lei Ma, Xiaohong Li","doi":"10.1002/smr.70057","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many <i>faults</i> still exist that LLMs cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLMs could face is important but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. We focus on classification tasks because most existing test selection methods target this setting and reliably estimating confidence scores for variable-length outputs in generative tasks remains challenging. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT-4) demonstrated that simple methods such as Margin perform well on LLMs, but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability for LLMs specifically on classification tasks. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 10","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70057","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many faults still exist that LLMs cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLMs could face is important but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. We focus on classification tasks because most existing test selection methods target this setting and reliably estimating confidence scores for variable-length outputs in generative tasks remains challenging. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT-4) demonstrated that simple methods such as Margin perform well on LLMs, but there is still a big room for improvement. Based on the study, we further propose MuCS, a prompt Mutation-based prediction Confidence Smoothing framework to boost the test selection capability for LLMs specifically on classification tasks. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%.