Identification Author of Source Code by Machine Learning Methods

Q3 Mathematics
A. Kurtukova, A. Romanov
{"title":"Identification Author of Source Code by Machine Learning Methods","authors":"A. Kurtukova, A. Romanov","doi":"10.15622/SP.2019.18.3.741-765","DOIUrl":null,"url":null,"abstract":"The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. \nThe paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. \nThe experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. \nThe experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.2019.18.3.741-765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 7

Abstract

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.
通过机器学习方法识别源代码的作者
本文致力于分析确定源代码作者的问题,对信息安全、计算机取证、教育过程质量评估、知识产权保护等领域的研究人员有一定的参考价值。本文详细分析了解决这一问题的现代方法。作者提出了两种新的基于机器学习算法的识别技术:支持向量机、快速相关滤波和信息特征;该技术基于混合卷积递归神经网络。实验数据库包括用Java、c++、Python、PHP、JavaScript、C、c#和Ruby编写的源代码样本。这些数据是通过托管it项目的web服务Github获得的。源代码总数超过15万个样本。平均长度为850个字符。案例大小为542位作者。实验是用最流行的编程语言编写的源代码进行的。采用10倍交叉验证对不同数量的作者所开发的技术的准确性进行评估。对最流行的Java编程语言进行了一系列额外的实验,作者从2人到50人不等。绘制了识别精度与箱体尺寸的关系图。结果分析表明,基于混合神经网络的方法准确率达到97%,是目前最知名的结果。基于支持向量机的技术使其准确率达到96%。混合神经网络的结果与支持向量机的结果相差约5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SPIIRAS Proceedings
SPIIRAS Proceedings Mathematics-Applied Mathematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
14 weeks
期刊介绍: The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信