{"title":"跨域代码搜索的自适应模型","authors":"Mengge Fang, Lie Wang, Haize Hu","doi":"10.1016/j.infsof.2025.107827","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.</div></div><div><h3>Objective:</h3><div>However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).</div></div><div><h3>Methods:</h3><div>To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.</div></div><div><h3>Results:</h3><div>To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.</div></div><div><h3>Conclusion:</h3><div>By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"186 ","pages":"Article 107827"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An adaptive model for cross-domain code search\",\"authors\":\"Mengge Fang, Lie Wang, Haize Hu\",\"doi\":\"10.1016/j.infsof.2025.107827\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.</div></div><div><h3>Objective:</h3><div>However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).</div></div><div><h3>Methods:</h3><div>To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.</div></div><div><h3>Results:</h3><div>To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.</div></div><div><h3>Conclusion:</h3><div>By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"186 \",\"pages\":\"Article 107827\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925001661\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001661","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Research on code search is one of the important research directions in the field of computer science. As software scales continue to grow and complexity increases, developers need to frequently search for and understand existing code in their daily work. Code search research aims to enhance the efficiency and accuracy of code search, including aspects such as natural language-based code search, code similarity comparison, code recommendation systems, and more. By delving into code search technologies, developers can more swiftly locate and comprehend the code they need, thereby boosting the efficiency and quality of software development.
Objective:
However, the reliance of deep learning-based code search models on large datasets and the substantial time needed to acquire model parameters can impose substantial economic costs. Furthermore, such models have certain limitations in their adaptability and perform sub-optimally when applied to a new dataset (i.e., Cross-Domain code search).
Methods:
To address these issues, we propose an Adaptive Cross-Domain code search model based on Self-Attention (ACD-SA), which is the first attempt to introduce a self-attention model into cross-domain code search. First, the fastText word embedding tool is employed to obtain the initial vector. Second, self-attention is utilized to effectively characterize the internal structure information of the initial vector to obtain the feature vector and model parameters. Next, a word matching matrix is constructed from the feature vectors to generate the initial grammatical information vector. Subsequently, a long-short term memory network (LSTM) is utilized to train the initial grammatical information vector and extract grammatical patterns. Finally, cross-domain code search analysis is performed by combining domain-specific word matching matrices and grammar patterns.
Results:
To verify the effectiveness of ACD-SA in cross-domain code search studies, an experimental comparative analysis is conducted on a training dataset and a target dataset. In comparison to existing baseline models, such as CodeHow, DeepCS, BAVE, and AdaCS, the experimental results demonstrate that ACD-SA yields superior results for Hit@2, Hit@3, Hit@5, Hit@10, and MRR.
Conclusion:
By analyzing the defects and shortcomings of existing methods in cross-domain code search, the article proposes an ACD-SA cross-domain code search model.ACD-SA only needs to be trained on large datasets and the model is applied to code search applications on domain-specific datasets. On the one hand, ACD-SA solves the problem that traditional code search needs to spend a lot of time on the collection or crawling of large datasets and the training of model parameters in each search task. On the other hand, ACD-SA makes up for the singularity of the existing code search model for dataset adaptation and realizes cross-domain code search.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.