Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2024-09-04 DOI:10.1109/TSE.2024.3450837

Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo

{"title":"Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction","authors":"Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo","doi":"10.1109/TSE.2024.3450837","DOIUrl":null,"url":null,"abstract":"Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique \n<sc>Libro</small>\n could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using \n<sc>Libro</small>\n improves as LLM size increases, providing information as to which LLMs can be used with the \n<sc>Libro</small>\n pipeline.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 10","pages":"2677-2694"},"PeriodicalIF":6.5000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10664637/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique Libro could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using Libro improves as LLM size increases, providing information as to which LLMs can be used with the Libro pipeline.

查看原文本刊更多论文

评估用于自动和通用错误复制的各种大型语言模型

Bug 重现是一项关键的开发人员活动，也是一项具有挑战性的自动化活动，因为 Bug 报告通常使用自然语言，因此很难一致地转换为测试用例。因此，现有技术大多集中在崩溃错误上，而崩溃错误更容易自动检测和验证。在这项工作中，我们通过使用大型语言模型（LLM）克服了这一局限性，大型语言模型已被证明擅长自然语言处理和代码生成。通过提示 LLM 生成会产生错误的测试，并通过后处理管道自动识别有希望生成的测试，我们提出的 Libro 技术可以成功地重现广泛使用的 Defects4J 基准中约三分之一的错误。此外，我们对包括 11 个开源 LLM 在内的 15 个 LLM 进行了广泛评估，结果表明开源 LLM 也展现出了巨大的潜力，其中 StarCoder LLM 在大型 Defects4J 基准上的重现性能达到了闭源 OpenAI LLM code-davinci-002 的 70%，在可能不属于任何 LLM 训练数据的保留错误数据集上的重现性能达到了 90%。此外，我们在不同规模的 LLM 上进行的实验表明，随着 LLM 规模的增加，使用 Libro 的错误再现能力也会提高，这为哪些 LLM 可以与 Libro 管道一起使用提供了信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.