Enhancing Project-Specific Code Completion by Inferring Internal API Information

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-07-25 DOI:10.1109/TSE.2025.3592823

Le Deng;Xiaoxia Ren;Chao Ni;Ming Liang;David Lo;Zhongxin Liu

{"title":"Enhancing Project-Specific Code Completion by Inferring Internal API Information","authors":"Le Deng;Xiaoxia Ren;Chao Ni;Ming Liang;David Lo;Zhongxin Liu","doi":"10.1109/TSE.2025.3592823","DOIUrl":null,"url":null,"abstract":"Project-specific code completion, which aims to complete code based on the context of the project, is an important and practical software engineering task. The state-of-the-art approaches employ the retrieval-augmented generation (RAG) paradigm and prompt large language models (LLMs) with information retrieved from the target project for project-specific code completion. In practice, developers always define and use custom functionalities, namely internal APIs, to facilitate the implementation of specific project requirements. Thus, it is essential to consider internal API information for accurate project-specific code completion. However, existing approaches either retrieve similar code snippets, which do not necessarily contain related internal API information, or retrieve internal API information based on import statements, which usually do not exist when the related internal APIs haven’t been used in the file. Therefore, these project-specific code completion approaches face challenges in effectiveness or practicability. To this end, this paper aims to enhance project-specific code completion by locating internal API information without relying on import statements. We first propose a method to infer internal API information. Our method first extends the representation of each internal API by constructing its usage examples and functional semantic information (i.e., a natural language description of the function’s purpose) and constructs a knowledge base. Based on the knowledge base, our method uses an initial completion solution generated by LLMs to infer the API information necessary for completion. Based on this method, we propose a code completion approach that enhances project-specific code completion by integrating similar code snippets and internal API information. Furthermore, we developed a benchmark named ProjBench, which consists of recent, large-scale real-world projects and is free of leaked import statements. We evaluated the effectiveness of our approach on ProjBench and an existing benchmark CrossCodeEval. Experimental results show that our approach outperforms the base-performing approach by an average of +5.91 in code exact match and +6.26 in identifier exact match, corresponding to relative improvements of 22.72% and 18.31%, respectively. We also show our method complements existing ones by integrating it into various baselines, boosting code match by +7.77 (47.80%) and identifier match by +8.50 (35.55%) on average.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2566-2582"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11096713/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Project-specific code completion, which aims to complete code based on the context of the project, is an important and practical software engineering task. The state-of-the-art approaches employ the retrieval-augmented generation (RAG) paradigm and prompt large language models (LLMs) with information retrieved from the target project for project-specific code completion. In practice, developers always define and use custom functionalities, namely internal APIs, to facilitate the implementation of specific project requirements. Thus, it is essential to consider internal API information for accurate project-specific code completion. However, existing approaches either retrieve similar code snippets, which do not necessarily contain related internal API information, or retrieve internal API information based on import statements, which usually do not exist when the related internal APIs haven’t been used in the file. Therefore, these project-specific code completion approaches face challenges in effectiveness or practicability. To this end, this paper aims to enhance project-specific code completion by locating internal API information without relying on import statements. We first propose a method to infer internal API information. Our method first extends the representation of each internal API by constructing its usage examples and functional semantic information (i.e., a natural language description of the function’s purpose) and constructs a knowledge base. Based on the knowledge base, our method uses an initial completion solution generated by LLMs to infer the API information necessary for completion. Based on this method, we propose a code completion approach that enhances project-specific code completion by integrating similar code snippets and internal API information. Furthermore, we developed a benchmark named ProjBench, which consists of recent, large-scale real-world projects and is free of leaked import statements. We evaluated the effectiveness of our approach on ProjBench and an existing benchmark CrossCodeEval. Experimental results show that our approach outperforms the base-performing approach by an average of +5.91 in code exact match and +6.26 in identifier exact match, corresponding to relative improvements of 22.72% and 18.31%, respectively. We also show our method complements existing ones by integrating it into various baselines, boosting code match by +7.77 (47.80%) and identifier match by +8.50 (35.55%) on average.

查看原文本刊更多论文

通过推断内部API信息来增强项目特定代码的完成

特定于项目的代码完成是一项重要且实用的软件工程任务，它旨在根据项目的上下文完成代码。最先进的方法采用检索增强生成（RAG）范式，并使用从目标项目检索的信息提示大型语言模型（llm），以完成特定于项目的代码。在实践中，开发人员总是定义和使用自定义功能，即内部api，以促进特定项目需求的实现。因此，为了准确地完成特定于项目的代码，必须考虑内部API信息。然而，现有的方法要么检索类似的代码片段（不一定包含相关的内部API信息），要么检索基于import语句的内部API信息，当相关的内部API未在文件中使用时，通常不存在import语句。因此，这些特定于项目的代码完成方法在有效性或实用性方面面临挑战。为此，本文旨在通过定位内部API信息而不依赖import语句来增强特定于项目的代码完成。我们首先提出了一种推断API内部信息的方法。我们的方法首先通过构建每个内部API的使用示例和功能语义信息（即，函数用途的自然语言描述）来扩展其表示，并构建知识库。基于知识库，我们的方法使用llm生成的初始完井解决方案来推断完井所需的API信息。基于这种方法，我们提出了一种代码完成方法，通过集成类似的代码片段和内部API信息来增强特定于项目的代码完成。此外，我们开发了一个名为ProjBench的基准测试，它包含了最近的大型现实世界项目，并且没有泄露的import语句。我们在ProjBench和现有的基准CrossCodeEval上评估了我们的方法的有效性。实验结果表明，我们的方法在代码精确匹配和标识符精确匹配方面的平均性能分别高出+5.91和+6.26，相对提高幅度分别为22.72%和18.31%。我们还展示了我们的方法通过将其集成到各种基线中来补充现有方法，平均将代码匹配提高了+7.77(47.80%)，将标识符匹配提高了+8.50（35.55%）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.