Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

arXiv - CS - Software Engineering Pub Date : 2024-09-06 DOI:arxiv-2409.04164

Luis Mayer, Christian Heumann, Matthias Aßenmacher

{"title":"Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation","authors":"Luis Mayer, Christian Heumann, Matthias Aßenmacher","doi":"arxiv-2409.04164","DOIUrl":null,"url":null,"abstract":"In recent years, large language models (LLMs) have emerged as powerful tools\nwith potential applications in various fields, including software engineering.\nWithin the scope of this research, we evaluate five different state-of-the-art\nLLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their\ncapabilities for text-to-code generation. In an empirical study, we feed\nprompts with textual descriptions of coding problems sourced from the\nprogramming website LeetCode to the models with the task of creating solutions\nin Python. Subsequently, the quality of the generated outputs is assessed using\nthe testing functionalities of LeetCode. The results indicate large differences\nin performance between the investigated models. ChatGPT can handle these\ntypical programming challenges by far the most effectively, surpassing even\ncode-specialized models like Code Llama. To gain further insights, we measure\nthe runtime as well as the memory usage of the generated outputs and compared\nthem to the other code submissions on Leetcode. A detailed error analysis,\nencompassing a comparison of the differences concerning correct indentation and\nform of the generated code as well as an assignment of the incorrectly solved\ntasks to certain error categories allows us to obtain a more nuanced picture of\nthe results and potential for improvement. The results also show a clear\npattern of increasingly incorrect produced code when the models are facing a\nlot of context in the form of longer prompts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"438 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.

查看原文本刊更多论文

开源能否击败 ChatGPT？-- 用于文本到代码生成的大型语言模型比较研究

近年来，大型语言模型（LLMs）作为一种强大的工具，在包括软件工程在内的各个领域都有潜在的应用前景。在本研究范围内，我们对 Bard、BingChat、ChatGPT、Llama2 和 Code Llama 这五种最先进的大型语言模型进行了评估，以了解它们在文本到代码生成方面的能力。在一项实证研究中，我们将来自编程网站 LeetCode 的编码问题文本描述输入到模型中，让模型用 Python 创建解决方案。随后，我们使用 LeetCode 的测试功能对生成输出的质量进行了评估。结果表明，所研究模型之间的性能差异很大。到目前为止，ChatGPT 能最有效地处理典型的编程挑战，甚至超过了 Code Llama 等代码专用模型。为了进一步深入了解，我们测量了生成输出的运行时间和内存使用情况，并将它们与 Leetcode 上提交的其他代码进行了比较。详细的错误分析包括比较生成代码的正确缩进和形式方面的差异，以及将错误解决的任务分配到特定的错误类别，这使我们能够对结果和改进潜力有更细致的了解。结果还显示了一个明显的模式，即当模型面对大量以较长提示形式出现的上下文时，生成的代码越来越不正确。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量