{"title":"No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT","authors":"Zhijie Liu;Yutian Tang;Xiapu Luo;Yuming Zhou;Liang Feng Zhang","doi":"10.1109/TSE.2024.3392499","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using \n<i>ChatGPT</i>\n, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by \n<i>ChatGPT</i>\n, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate \n<i>ChatGPT</i>\n's ability to engage in multi-round fixing process (i.e., \n<i>ChatGPT</i>\n's dialog ability, chatting between users and \n<i>ChatGPT</i>\n for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of \n<i>ChatGPT</i>\n in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) \n<i>ChatGPT</i>\n is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with \n<inline-formula><tex-math>$48.14\\%$</tex-math></inline-formula>\n advantage in \n<i>Accepted</i>\n rate on judgment platform, but \n<i>ChatGPT</i>\n's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with \n<i>ChatGPT </i>\n generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by \n<i>ChatGPT </i>\n has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than \n<inline-formula><tex-math>$89\\%$</tex-math></inline-formula>\n of vulnerabilities successfully addressed; and (4) code generation may be affected by \n<i>ChatGPT</i>\n's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the \n<i>ChatGPT</i>\n-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 6","pages":"1548-1584"},"PeriodicalIF":6.5000,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10507163/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using
ChatGPT
, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by
ChatGPT
, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate
ChatGPT
's ability to engage in multi-round fixing process (i.e.,
ChatGPT
's dialog ability, chatting between users and
ChatGPT
for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of
ChatGPT
in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1)
ChatGPT
is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with
$48.14\%$
advantage in
Accepted
rate on judgment platform, but
ChatGPT
's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with
ChatGPT
generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by
ChatGPT
has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than
$89\%$
of vulnerabilities successfully addressed; and (4) code generation may be affected by
ChatGPT
's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the
ChatGPT
-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.