Plagiarism Detection in Programming Assignments Using Deep Features

2017 4th IAPR Asian Conference on Pattern Recognition (ACPR) Pub Date : 2017-11-01 DOI:10.1109/ACPR.2017.146

Jitendra Yasaswi, Suresh Purini, C. V. Jawahar

引用次数: 13

Abstract

This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-rnn model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.

查看原文本刊更多论文

基于深度特征的编程作业抄袭检测

本文提出了一种基于深度特征的源代码抄袭检测方法。程序嵌入使用字符级递归神经网络(char-rnn)获得，该网络在Linux内核源代码上进行预训练。许多流行的抄袭检测工具都是基于句法层面的n-gram技术。然而，这些抄袭检测方法无法捕获源代码中存在的长期依赖关系(非连续交互)。相反，所提出的深度特征捕获n-gram内的非连续相互作用。这些本质上是通用的，并且不需要再次微调char-rnn模型来编写来自每个单独问题集的提交。我们的实验表明，深度特征在将作业程序提交分类为副本、部分副本和非副本的任务中是有效的。将我们提出的特征与手工制作的特征(源代码指标和文本特征)进行比较，我们报告在二元分类任务和三元分类任务中分别提高了9.5%和5%的f1分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)

自引率

0.00%

发文量