使用大型语言模型的本地项目的检索增强代码完成

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-14 DOI:10.1016/j.eswa.2025.128596

Marko Hostnik , Marko Robnik-Šikonja

{"title":"使用大型语言模型的本地项目的检索增强代码完成","authors":"Marko Hostnik , Marko Robnik-Šikonja","doi":"10.1016/j.eswa.2025.128596","DOIUrl":null,"url":null,"abstract":"<div><div>The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using relatively small and efficient LLMs with 160M parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two open transformer-based models, the generative GPT-2 and the retrieval-adapted RETRO, on open-source Python files, and empirically compare them, confirming the benefits of embedding-based retrieval. Furthermore, we improve our models’ performance with In-context retrieval-augmented generation (RAG), which retrieves code snippets using the Jaccard similarity of tokens. We evaluate In-context RAG on larger models and determine that, despite its simplicity, the approach is more suitable than using the RETRO architecture. Experimental results indicate that In-context RAG improves the code completion baseline by over 26 %, while RETRO improves over the similarly sized GPT-2 baseline by 12 %. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"292 ","pages":"Article 128596"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Retrieval-augmented code completion for local projects using large language models\",\"authors\":\"Marko Hostnik , Marko Robnik-Šikonja\",\"doi\":\"10.1016/j.eswa.2025.128596\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using relatively small and efficient LLMs with 160M parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two open transformer-based models, the generative GPT-2 and the retrieval-adapted RETRO, on open-source Python files, and empirically compare them, confirming the benefits of embedding-based retrieval. Furthermore, we improve our models’ performance with In-context retrieval-augmented generation (RAG), which retrieves code snippets using the Jaccard similarity of tokens. We evaluate In-context RAG on larger models and determine that, despite its simplicity, the approach is more suitable than using the RETRO architecture. Experimental results indicate that In-context RAG improves the code completion baseline by over 26 %, while RETRO improves over the similarly sized GPT-2 baseline by 12 %. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"292 \",\"pages\":\"Article 128596\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425022158\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425022158","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）的使用在软件开发人员中变得越来越普遍。然而，商业解决方案和llm的使用存在隐私和计算需求问题。在这项工作中，我们专注于使用相对较小且高效的llm，具有160M参数，适合本地执行和从本地项目中检索的扩展。我们在开源Python文件上训练了两个开放的基于变压器的模型，生成式GPT-2和适应检索的RETRO，并对它们进行了经验比较，证实了基于嵌入的检索的好处。此外，我们使用上下文检索增强生成（In-context retrieval-augmented generation， RAG）改进了模型的性能，它使用标记的Jaccard相似性检索代码片段。我们在更大的模型上评估了In-context RAG，并确定，尽管它很简单，但这种方法比使用RETRO架构更合适。实验结果表明，In-context RAG将代码完成基线提高了26%以上，而RETRO在类似大小的GPT-2基线上提高了12%。我们强调了适当的标记化在实现llm在代码完成中的全部潜力中的关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Retrieval-augmented code completion for local projects using large language models

The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using relatively small and efficient LLMs with 160M parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two open transformer-based models, the generative GPT-2 and the retrieval-adapted RETRO, on open-source Python files, and empirically compare them, confirming the benefits of embedding-based retrieval. Furthermore, we improve our models’ performance with In-context retrieval-augmented generation (RAG), which retrieves code snippets using the Jaccard similarity of tokens. We evaluate In-context RAG on larger models and determine that, despite its simplicity, the approach is more suitable than using the RETRO architecture. Experimental results indicate that In-context RAG improves the code completion baseline by over 26 %, while RETRO improves over the similarly sized GPT-2 baseline by 12 %. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.