跨域代码搜索的自适应模型

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-07-02 DOI:10.1016/j.infsof.2025.107827

Mengge Fang, Lie Wang, Haize Hu

{"title":"跨域代码搜索的自适应模型","authors":"Mengge Fang, Lie Wang, Haize Hu","doi":"10.1016/j.infsof.2025.107827","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.</div></div><div><h3>Objective:</h3><div>However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).</div></div><div><h3>Methods:</h3><div>To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.</div></div><div><h3>Results:</h3><div>To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.</div></div><div><h3>Conclusion:</h3><div>By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"186 ","pages":"Article 107827"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An adaptive model for cross-domain code search\",\"authors\":\"Mengge Fang, Lie Wang, Haize Hu\",\"doi\":\"10.1016/j.infsof.2025.107827\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.</div></div><div><h3>Objective:</h3><div>However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).</div></div><div><h3>Methods:</h3><div>To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.</div></div><div><h3>Results:</h3><div>To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.</div></div><div><h3>Conclusion:</h3><div>By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"186 \",\"pages\":\"Article 107827\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925001661\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001661","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

背景：代码搜索研究是计算机科学领域的重要研究方向之一。随着软件规模的持续增长和复杂性的增加，开发人员需要在日常工作中频繁地搜索和理解现有的代码。代码搜索研究旨在提高代码搜索的效率和准确性，包括基于自然语言的代码搜索、代码相似度比较、代码推荐系统等方面。通过深入研究代码搜索技术，开发人员可以更快地定位和理解他们需要的代码，从而提高软件开发的效率和质量。然而，基于深度学习的代码搜索模型对大型数据集的依赖以及获取模型参数所需的大量时间可能会带来巨大的经济成本。此外，这些模型在适应性上有一定的局限性，并且在应用于新数据集（即跨域代码搜索）时表现不佳。方法：针对这些问题，我们提出了一种基于自注意的自适应跨域代码搜索模型（ACD-SA），首次将自注意模型引入到跨域代码搜索中。首先，利用fastText词嵌入工具获取初始向量；其次，利用自关注对初始向量的内部结构信息进行有效表征，得到特征向量和模型参数；然后，从特征向量构建单词匹配矩阵，生成初始语法信息向量。随后，利用长短期记忆网络（LSTM）训练初始语法信息向量，提取语法模式。最后，结合特定领域的词匹配矩阵和语法模式进行跨领域代码搜索分析。结果：为了验证ACD-SA在跨域代码搜索研究中的有效性，我们对训练数据集和目标数据集进行了实验对比分析。与现有的基线模型（如CodeHow， DeepCS， have和AdaCS）相比，实验结果表明，ACD-SA在Hit@2， Hit@3, Hit@5， Hit@10和MRR方面取得了更好的结果。结论：通过分析现有跨域代码搜索方法的缺陷和不足，提出了一种ACD-SA跨域代码搜索模型。ACD-SA只需要在大型数据集上进行训练，并将该模型应用于特定领域数据集上的代码搜索应用。一方面，ACD-SA解决了传统代码搜索在每个搜索任务中需要花费大量时间收集或爬行大型数据集以及训练模型参数的问题。另一方面，ACD-SA弥补了现有代码搜索模型在数据集自适应方面的奇异性，实现了跨域代码搜索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An adaptive model for cross-domain code search

Context:

Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.

Objective:

However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).

Methods:

To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.

Results:

To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.

Conclusion:

By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.