UNA: Improving Automated PL-NL System by A Unified Neural Architecture

IF 5.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Reliability Pub Date : 2025-03-04 DOI:10.1109/TR.2025.3541087

Dawei Yuan;Tao Zhang;He Jiang

{"title":"UNA: Improving Automated PL-NL System by A Unified Neural Architecture","authors":"Dawei Yuan;Tao Zhang;He Jiang","doi":"10.1109/TR.2025.3541087","DOIUrl":null,"url":null,"abstract":"With the extensive application of artificial intelligence (AI) technologies, automated programming language-natural language (PL-NL) systems have gained significant attention, driving a series of related tasks served for developers and users, such as code search and summarization. Currently, mainstream PL-NL systems regard PL-NL as bimodal data and utilize two individual neural architectures (e.g., recurrent neural network) to learn the representation of PL-NL and build their semantic relations, improving the effects of these tasks. However, there exist two issues that limit the ability of these service systems in representation learning: first, large vocabularies cause data sparsity problems and limit the learning ability of neural architectures; second, there is not always a one-to-one correspondence between source code and natural language. To address these two issues, in this article, we introduce the unified neural architecture (UNA) by building a unified vocabulary (Uni-Vocab) at the subword level, to provide high-quality PL-NL services. In the Uni-Vocab, we build a unified modal encoding for PL-NL, which allows us to effectively control the vocabulary size and solve the data sparsity problem. Afterward, our built UNA can learn the unified contextual representation of PL-NL, which helps build their unified semantic relations. To validate the effectiveness of the proposed UNA, we perform experiments on code search and code summarization, which are two PL-NL tasks for developers and users. Experimental results demonstrate UNA can obtain noteworthy performance improvement. In detail, the baseline approaches in these two tasks get improvements by up to 36.09% and 18.02% in terms of mean reciprocal rank and bilingual evaluation understudy, respectively.","PeriodicalId":56305,"journal":{"name":"IEEE Transactions on Reliability","volume":"74 3","pages":"3585-3599"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Reliability","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10909995/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

With the extensive application of artificial intelligence (AI) technologies, automated programming language-natural language (PL-NL) systems have gained significant attention, driving a series of related tasks served for developers and users, such as code search and summarization. Currently, mainstream PL-NL systems regard PL-NL as bimodal data and utilize two individual neural architectures (e.g., recurrent neural network) to learn the representation of PL-NL and build their semantic relations, improving the effects of these tasks. However, there exist two issues that limit the ability of these service systems in representation learning: first, large vocabularies cause data sparsity problems and limit the learning ability of neural architectures; second, there is not always a one-to-one correspondence between source code and natural language. To address these two issues, in this article, we introduce the unified neural architecture (UNA) by building a unified vocabulary (Uni-Vocab) at the subword level, to provide high-quality PL-NL services. In the Uni-Vocab, we build a unified modal encoding for PL-NL, which allows us to effectively control the vocabulary size and solve the data sparsity problem. Afterward, our built UNA can learn the unified contextual representation of PL-NL, which helps build their unified semantic relations. To validate the effectiveness of the proposed UNA, we perform experiments on code search and code summarization, which are two PL-NL tasks for developers and users. Experimental results demonstrate UNA can obtain noteworthy performance improvement. In detail, the baseline approaches in these two tasks get improvements by up to 36.09% and 18.02% in terms of mean reciprocal rank and bilingual evaluation understudy, respectively.

查看原文本刊更多论文

用统一的神经结构改进自动化PL-NL系统

随着人工智能（AI）技术的广泛应用，自动编程语言-自然语言（PL-NL）系统得到了广泛的关注，推动了一系列为开发人员和用户服务的相关任务，如代码搜索和摘要。目前，主流的PL-NL系统将PL-NL视为双峰数据，利用两种独立的神经架构（如递归神经网络）来学习PL-NL的表征并构建其语义关系，从而提高了这些任务的效果。然而，存在两个问题限制了这些服务系统在表示学习方面的能力：一是词汇量大导致数据稀疏性问题，限制了神经结构的学习能力；其次，源代码和自然语言之间并不总是一对一的对应关系。为了解决这两个问题，本文通过在子词层面构建统一词汇表（Uni-Vocab），引入统一神经架构（UNA），以提供高质量的PL-NL服务。在Uni-Vocab中，我们为PL-NL构建了统一的模态编码，使我们能够有效地控制词汇量大小，解决数据稀疏性问题。然后，我们构建的UNA可以学习PL-NL的统一上下文表示，这有助于建立它们统一的语义关系。为了验证所提出的UNA的有效性，我们对代码搜索和代码摘要进行了实验，这是开发人员和用户的两个PL-NL任务。实验结果表明，UNA可以获得显著的性能改进。具体而言，这两个任务的基线方法在平均对等排名和双语评估方面分别提高了36.09%和18.02%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Reliability 工程技术-工程：电子与电气

CiteScore

12.20

自引率

8.50%

发文量

153

审稿时长

7.5 months

期刊介绍： IEEE Transactions on Reliability is a refereed journal for the reliability and allied disciplines including, but not limited to, maintainability, physics of failure, life testing, prognostics, design and manufacture for reliability, reliability for systems of systems, network availability, mission success, warranty, safety, and various measures of effectiveness. Topics eligible for publication range from hardware to software, from materials to systems, from consumer and industrial devices to manufacturing plants, from individual items to networks, from techniques for making things better to ways of predicting and measuring behavior in the field. As an engineering subject that supports new and existing technologies, we constantly expand into new areas of the assurance sciences.