基于CNN和CodeBERT架构的文本模型源代码图嵌入的比较

Vitaly Romanov, Vladimir Ivanov
{"title":"基于CNN和CodeBERT架构的文本模型源代码图嵌入的比较","authors":"Vitaly Romanov, Vladimir Ivanov","doi":"10.15514/ispras-2023-35(1)-15","DOIUrl":null,"url":null,"abstract":"One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.","PeriodicalId":33459,"journal":{"name":"Trudy Instituta sistemnogo programmirovaniia RAN","volume":"124 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures\",\"authors\":\"Vitaly Romanov, Vladimir Ivanov\",\"doi\":\"10.15514/ispras-2023-35(1)-15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.\",\"PeriodicalId\":33459,\"journal\":{\"name\":\"Trudy Instituta sistemnogo programmirovaniia RAN\",\"volume\":\"124 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Trudy Instituta sistemnogo programmirovaniia RAN\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15514/ispras-2023-35(1)-15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Trudy Instituta sistemnogo programmirovaniia RAN","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15514/ispras-2023-35(1)-15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

减少源代码中bug的一种可能方法是创建智能工具,使开发过程更容易。这些工具通常使用源代码的向量表示和从自然语言处理领域借鉴的机器学习技术。然而,这种方法没有考虑到源代码及其结构的细节。这项工作研究了源代码的预训练图向量表示方法,其中图表示程序的结构。结果表明,图嵌入可以在Python程序中实现与CodeBERT嵌入相当的变量类型分类的准确性。此外,同时使用文本和图形嵌入作为混合模型的一部分,可以将类型分类的准确率提高10%以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures
One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
18
审稿时长
4 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信