Towards Improving Multiple Authorship Attribution of Source Code

2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS) Pub Date : 2022-12-01 DOI:10.1109/QRS57517.2022.00059

Pengnan Hao, Zhuguo Li, Cui Liu, Yu Wen, Fanming Liu

引用次数: 0

Abstract

Source code authorship attribution addresses the problems of copyright infringement disputes and plagiarism detection. However, most software projects are collaborative development projects. It is necessary to study multiple authorship attribution. Existing methods are not reliable in the domain of multiple authorship attribution. The reasons are as follows: i) It is a challenge to divide the code boundaries of different authors in a sample; ii) code segments belonging to different authors in a sample are usually small or incomplete. This paper proposes a method to address these challenges. We first divide the code sample into multiple lines, then integrate the code lines with similar author styles into code segments using Siamese networks. Finally, we use a path-based code representation and machine learning to identify authors. Experimental results show the method achieves an accuracy of 87.35% on C/C++ dataset and 91.35% on Java dataset, which performs better than existing methods.

查看原文本刊更多论文

改进源代码的多重作者归属

源代码作者归属解决了版权侵权纠纷和剽窃检测的问题。然而，大多数软件项目都是协作开发项目。研究多重作者归属是必要的。现有方法在多作者归属领域不可靠。原因如下:i)在一个样本中划分不同作者的代码边界是一个挑战;Ii)样本中属于不同作者的代码段通常很小或不完整。本文提出了一种解决这些挑战的方法。我们首先将代码样本分成多行，然后使用Siamese网络将具有相似作者风格的代码行集成到代码段中。最后，我们使用基于路径的代码表示和机器学习来识别作者。实验结果表明，该方法在C/ c++数据集上的准确率为87.35%，在Java数据集上的准确率为91.35%，优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)

自引率

0.00%

发文量