Python Source Code De-anonymization Using Nested Bigrams

2018 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2018-11-01 DOI:10.1109/ICDMW.2018.00011

Pegah Hozhabrierdi, D. Hitos, C. Mohan

引用次数: 3

Abstract

An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

查看原文本刊更多论文

使用嵌套双元的Python源代码去匿名化

网络安全中的一个重要问题是代码原作者以外的个人对代码的插入或修改。这激发了对未知源代码作者归属的研究。我们解决了以前使用的特征提取方法的不足，并提出了一种新的方法:嵌套双元图。这些特征很容易提取，并且携带有关抽象语法树节点之间相互联系的大量信息。我们还表明，对于大量作者来说，强正则化前馈神经网络优于许多代码风格学研究中使用的随机森林分类器。本文还提出了一种新的减少特征数量的排序系统，实验表明，该方法可以将特征集减少到98个嵌套双元，同时保持90%以上的分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量