Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo
{"title":"Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction","authors":"Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo","doi":"10.1109/TSE.2024.3450837","DOIUrl":null,"url":null,"abstract":"Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique \n<sc>Libro</small>\n could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using \n<sc>Libro</small>\n improves as LLM size increases, providing information as to which LLMs can be used with the \n<sc>Libro</small>\n pipeline.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 10","pages":"2677-2694"},"PeriodicalIF":6.5000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10664637/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique Libro could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using Libro improves as LLM size increases, providing information as to which LLMs can be used with the Libro pipeline.
评估用于自动和通用错误复制的各种大型语言模型
Bug 重现是一项关键的开发人员活动,也是一项具有挑战性的自动化活动,因为 Bug 报告通常使用自然语言,因此很难一致地转换为测试用例。因此,现有技术大多集中在崩溃错误上,而崩溃错误更容易自动检测和验证。在这项工作中,我们通过使用大型语言模型(LLM)克服了这一局限性,大型语言模型已被证明擅长自然语言处理和代码生成。通过提示 LLM 生成会产生错误的测试,并通过后处理管道自动识别有希望生成的测试,我们提出的 Libro 技术可以成功地重现广泛使用的 Defects4J 基准中约三分之一的错误。此外,我们对包括 11 个开源 LLM 在内的 15 个 LLM 进行了广泛评估,结果表明开源 LLM 也展现出了巨大的潜力,其中 StarCoder LLM 在大型 Defects4J 基准上的重现性能达到了闭源 OpenAI LLM code-davinci-002 的 70%,在可能不属于任何 LLM 训练数据的保留错误数据集上的重现性能达到了 90%。此外,我们在不同规模的 LLM 上进行的实验表明,随着 LLM 规模的增加,使用 Libro 的错误再现能力也会提高,这为哪些 LLM 可以与 Libro 管道一起使用提供了信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering 工程技术-工程:电子与电气
CiteScore
9.70
自引率
10.80%
发文量
724
审稿时长
6 months
期刊介绍: IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信