Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang
{"title":"Can GPT-O1 Kill All Bugs?","authors":"Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang","doi":"arxiv-2409.10033","DOIUrl":null,"url":null,"abstract":"ChatGPT has long been proven to be effective in automatic program repair\n(APR). With the continuous iterations and upgrades of the ChatGPT version, its\nperformance in terms of fixes has already reached state-of-the-art levels.\nHowever, there are few works comparing the effectiveness and variations of\ndifferent versions of ChatGPT on APR. In this work, we evaluate the performance\nof the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and\nhistorical version of ChatGPT on APR. We study the improvements of the O1 model\nover traditional ChatGPT in terms of APR from multiple perspectives (repair\nsuccess rate, repair cost, behavior patterns), and find that O1's repair\ncapability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs\nin the benchmark. Our work can serve as a reference for further in-depth\nexploration of the applications of ChatGPT in APR.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
ChatGPT has long been proven to be effective in automatic program repair
(APR). With the continuous iterations and upgrades of the ChatGPT version, its
performance in terms of fixes has already reached state-of-the-art levels.
However, there are few works comparing the effectiveness and variations of
different versions of ChatGPT on APR. In this work, we evaluate the performance
of the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and
historical version of ChatGPT on APR. We study the improvements of the O1 model
over traditional ChatGPT in terms of APR from multiple perspectives (repair
success rate, repair cost, behavior patterns), and find that O1's repair
capability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs
in the benchmark. Our work can serve as a reference for further in-depth
exploration of the applications of ChatGPT in APR.