基于CNN和CodeBERT架构的文本模型源代码图嵌入的比较

Trudy Instituta sistemnogo programmirovaniia RAN Pub Date : 2023-01-01 DOI:10.15514/ispras-2023-35(1)-15

Vitaly Romanov, Vladimir Ivanov

{"title":"基于CNN和CodeBERT架构的文本模型源代码图嵌入的比较","authors":"Vitaly Romanov, Vladimir Ivanov","doi":"10.15514/ispras-2023-35(1)-15","DOIUrl":null,"url":null,"abstract":"One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.","PeriodicalId":33459,"journal":{"name":"Trudy Instituta sistemnogo programmirovaniia RAN","volume":"124 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures\",\"authors\":\"Vitaly Romanov, Vladimir Ivanov\",\"doi\":\"10.15514/ispras-2023-35(1)-15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.\",\"PeriodicalId\":33459,\"journal\":{\"name\":\"Trudy Instituta sistemnogo programmirovaniia RAN\",\"volume\":\"124 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Trudy Instituta sistemnogo programmirovaniia RAN\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15514/ispras-2023-35(1)-15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Trudy Instituta sistemnogo programmirovaniia RAN","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15514/ispras-2023-35(1)-15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

减少源代码中bug的一种可能方法是创建智能工具，使开发过程更容易。这些工具通常使用源代码的向量表示和从自然语言处理领域借鉴的机器学习技术。然而，这种方法没有考虑到源代码及其结构的细节。这项工作研究了源代码的预训练图向量表示方法，其中图表示程序的结构。结果表明，图嵌入可以在Python程序中实现与CodeBERT嵌入相当的变量类型分类的准确性。此外，同时使用文本和图形嵌入作为混合模型的一部分，可以将类型分类的准确率提高10%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures

One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Trudy Instituta sistemnogo programmirovaniia RAN

自引率

0.00%

发文量

审稿时长

4 weeks