Identification Author of Source Code by Machine Learning Methods

Q3 Mathematics

SPIIRAS Proceedings Pub Date : 2019-06-04 DOI:10.15622/SP.2019.18.3.741-765

A. Kurtukova, A. Romanov

{"title":"Identification Author of Source Code by Machine Learning Methods","authors":"A. Kurtukova, A. Romanov","doi":"10.15622/SP.2019.18.3.741-765","DOIUrl":null,"url":null,"abstract":"The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. \nThe paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. \nThe experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. \nThe experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.2019.18.3.741-765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 7

Abstract

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.

查看原文本刊更多论文

通过机器学习方法识别源代码的作者

本文致力于分析确定源代码作者的问题，对信息安全、计算机取证、教育过程质量评估、知识产权保护等领域的研究人员有一定的参考价值。本文详细分析了解决这一问题的现代方法。作者提出了两种新的基于机器学习算法的识别技术:支持向量机、快速相关滤波和信息特征;该技术基于混合卷积递归神经网络。实验数据库包括用Java、c++、Python、PHP、JavaScript、C、c#和Ruby编写的源代码样本。这些数据是通过托管it项目的web服务Github获得的。源代码总数超过15万个样本。平均长度为850个字符。案例大小为542位作者。实验是用最流行的编程语言编写的源代码进行的。采用10倍交叉验证对不同数量的作者所开发的技术的准确性进行评估。对最流行的Java编程语言进行了一系列额外的实验，作者从2人到50人不等。绘制了识别精度与箱体尺寸的关系图。结果分析表明，基于混合神经网络的方法准确率达到97%，是目前最知名的结果。基于支持向量机的技术使其准确率达到96%。混合神经网络的结果与支持向量机的结果相差约5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SPIIRAS Proceedings Mathematics-Applied Mathematics

CiteScore

1.90

自引率

0.00%

发文量

审稿时长

14 weeks

期刊介绍： The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.