揭开代码预训练模型的神秘面纱：调查语法和语义能力

IF 6.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology Pub Date : 2024-05-09 DOI:10.1145/3664606

Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhang Wang, Qiang Hu, Jie Zhang, Yang Liu

{"title":"揭开代码预训练模型的神秘面纱：调查语法和语义能力","authors":"Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhang Wang, Qiang Hu, Jie Zhang, Yang Liu","doi":"10.1145/3664606","DOIUrl":null,"url":null,"abstract":"<p>Code models have made significant advancements in code intelligence by encoding knowledge about programming languages. While previous studies have explored the capabilities of these models in learning code syntax, there has been limited investigation on their ability to understand code semantics. Additionally, existing analyses assume the number of edges between nodes at the abstract syntax tree (AST) is related to syntax distance, and also often require transforming the high-dimensional space of deep learning models to a low-dimensional one, which may introduce inaccuracies. To study how code models represent code syntax and semantics, we conduct a comprehensive analysis of 7 code models, including four representative code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama and CodeT5+). We design four probing tasks to assess the models’ capacities in learning both code syntax and semantics. These probing tasks reconstruct code syntax and semantics structures (AST, CDG, DDG and CFG) in the representation space. These structures are core concepts for code understanding. We also investigate the syntax token role in each token representation and the long dependency between the code tokens. Additionally, we analyze the distribution of attention weights related to code semantic structures. Through extensive analysis, our findings highlight the strengths and limitations of different code models in learning code syntax and semantics. The results demonstrate that these models excel in learning code syntax, successfully capturing the syntax relationships between tokens and the syntax roles of individual tokens. However, their performance in encoding code semantics varies. CodeT5 and CodeBERT demonstrate proficiency in capturing control and data dependencies, while UnixCoder shows weaker performance in this aspect. We do not observe LLMs generally performing much better than pre-trained models. The shallow layers of LLMs perform better than their deep layers. The investigation of attention weights reveals that different attention heads play distinct roles in encoding code semantics. Our research findings emphasize the need for further enhancements in code models to better learn code semantics. This study contributes to the understanding of code models’ abilities in syntax and semantics analysis. Our findings provide guidance for future improvements in code models, facilitating their effective application in various code-related tasks.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"29 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities\",\"authors\":\"Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhang Wang, Qiang Hu, Jie Zhang, Yang Liu\",\"doi\":\"10.1145/3664606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Code models have made significant advancements in code intelligence by encoding knowledge about programming languages. While previous studies have explored the capabilities of these models in learning code syntax, there has been limited investigation on their ability to understand code semantics. Additionally, existing analyses assume the number of edges between nodes at the abstract syntax tree (AST) is related to syntax distance, and also often require transforming the high-dimensional space of deep learning models to a low-dimensional one, which may introduce inaccuracies. To study how code models represent code syntax and semantics, we conduct a comprehensive analysis of 7 code models, including four representative code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama and CodeT5+). We design four probing tasks to assess the models’ capacities in learning both code syntax and semantics. These probing tasks reconstruct code syntax and semantics structures (AST, CDG, DDG and CFG) in the representation space. These structures are core concepts for code understanding. We also investigate the syntax token role in each token representation and the long dependency between the code tokens. Additionally, we analyze the distribution of attention weights related to code semantic structures. Through extensive analysis, our findings highlight the strengths and limitations of different code models in learning code syntax and semantics. The results demonstrate that these models excel in learning code syntax, successfully capturing the syntax relationships between tokens and the syntax roles of individual tokens. However, their performance in encoding code semantics varies. CodeT5 and CodeBERT demonstrate proficiency in capturing control and data dependencies, while UnixCoder shows weaker performance in this aspect. We do not observe LLMs generally performing much better than pre-trained models. The shallow layers of LLMs perform better than their deep layers. The investigation of attention weights reveals that different attention heads play distinct roles in encoding code semantics. Our research findings emphasize the need for further enhancements in code models to better learn code semantics. This study contributes to the understanding of code models’ abilities in syntax and semantics analysis. Our findings provide guidance for future improvements in code models, facilitating their effective application in various code-related tasks.</p>\",\"PeriodicalId\":50933,\"journal\":{\"name\":\"ACM Transactions on Software Engineering and Methodology\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2024-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Software Engineering and Methodology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3664606\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664606","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

代码模型通过编码编程语言知识，在代码智能方面取得了重大进展。虽然之前的研究已经探索了这些模型在学习代码语法方面的能力，但对其理解代码语义的能力的研究还很有限。此外，现有分析假定抽象语法树（AST）节点间的边数与语法距离有关，而且往往需要将深度学习模型的高维空间转换为低维空间，这可能会带来误差。为了研究代码模型如何表示代码语法和语义，我们对 7 个代码模型进行了全面分析，其中包括 4 个具有代表性的代码预训练模型（CodeBERT、GraphCodeBERT、CodeT5 和 UnixCoder）和 3 个大型语言模型（StarCoder、CodeLlama 和 CodeT5+）。我们设计了四个探测任务来评估模型学习代码语法和语义的能力。这些探测任务在表示空间中重建代码语法和语义结构（AST、CDG、DDG 和 CFG）。这些结构是代码理解的核心概念。我们还研究了语法标记在每个标记表征中的作用以及代码标记之间的长期依赖关系。此外，我们还分析了与代码语义结构相关的注意力权重分布。通过广泛的分析，我们的研究结果突出了不同代码模型在学习代码语法和语义方面的优势和局限性。结果表明，这些模型在学习代码语法方面表现出色，能够成功捕捉到标记之间的语法关系以及单个标记的语法作用。然而，它们在编码代码语义方面的表现却各不相同。CodeT5 和 CodeBERT 在捕捉控制和数据依赖关系方面表现出色，而 UnixCoder 在这方面表现较弱。我们没有发现 LLM 的性能普遍比预训练模型好得多。LLM 的浅层比深层表现更好。对注意力权重的研究表明，不同的注意力头在编码代码语义时发挥着不同的作用。我们的研究结果强调了进一步改进代码模型以更好地学习代码语义的必要性。这项研究有助于理解代码模型在语法和语义分析方面的能力。我们的研究结果为今后改进代码模型提供了指导，有助于它们在各种代码相关任务中的有效应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

Code models have made significant advancements in code intelligence by encoding knowledge about programming languages. While previous studies have explored the capabilities of these models in learning code syntax, there has been limited investigation on their ability to understand code semantics. Additionally, existing analyses assume the number of edges between nodes at the abstract syntax tree (AST) is related to syntax distance, and also often require transforming the high-dimensional space of deep learning models to a low-dimensional one, which may introduce inaccuracies. To study how code models represent code syntax and semantics, we conduct a comprehensive analysis of 7 code models, including four representative code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama and CodeT5+). We design four probing tasks to assess the models’ capacities in learning both code syntax and semantics. These probing tasks reconstruct code syntax and semantics structures (AST, CDG, DDG and CFG) in the representation space. These structures are core concepts for code understanding. We also investigate the syntax token role in each token representation and the long dependency between the code tokens. Additionally, we analyze the distribution of attention weights related to code semantic structures. Through extensive analysis, our findings highlight the strengths and limitations of different code models in learning code syntax and semantics. The results demonstrate that these models excel in learning code syntax, successfully capturing the syntax relationships between tokens and the syntax roles of individual tokens. However, their performance in encoding code semantics varies. CodeT5 and CodeBERT demonstrate proficiency in capturing control and data dependencies, while UnixCoder shows weaker performance in this aspect. We do not observe LLMs generally performing much better than pre-trained models. The shallow layers of LLMs perform better than their deep layers. The investigation of attention weights reveals that different attention heads play distinct roles in encoding code semantics. Our research findings emphasize the need for further enhancements in code models to better learn code semantics. This study contributes to the understanding of code models’ abilities in syntax and semantics analysis. Our findings provide guidance for future improvements in code models, facilitating their effective application in various code-related tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.