基于深度特征的编程作业抄袭检测

2017 4th IAPR Asian Conference on Pattern Recognition (ACPR) Pub Date : 2017-11-01 DOI:10.1109/ACPR.2017.146

Jitendra Yasaswi, Suresh Purini, C. V. Jawahar

{"title":"基于深度特征的编程作业抄袭检测","authors":"Jitendra Yasaswi, Suresh Purini, C. V. Jawahar","doi":"10.1109/ACPR.2017.146","DOIUrl":null,"url":null,"abstract":"This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-rnn model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.","PeriodicalId":426561,"journal":{"name":"2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Plagiarism Detection in Programming Assignments Using Deep Features\",\"authors\":\"Jitendra Yasaswi, Suresh Purini, C. V. Jawahar\",\"doi\":\"10.1109/ACPR.2017.146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-rnn model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.\",\"PeriodicalId\":426561,\"journal\":{\"name\":\"2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACPR.2017.146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACPR.2017.146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

本文提出了一种基于深度特征的源代码抄袭检测方法。程序嵌入使用字符级递归神经网络(char-rnn)获得，该网络在Linux内核源代码上进行预训练。许多流行的抄袭检测工具都是基于句法层面的n-gram技术。然而，这些抄袭检测方法无法捕获源代码中存在的长期依赖关系(非连续交互)。相反，所提出的深度特征捕获n-gram内的非连续相互作用。这些本质上是通用的，并且不需要再次微调char-rnn模型来编写来自每个单独问题集的提交。我们的实验表明，深度特征在将作业程序提交分类为副本、部分副本和非副本的任务中是有效的。将我们提出的特征与手工制作的特征(源代码指标和文本特征)进行比较，我们报告在二元分类任务和三元分类任务中分别提高了9.5%和5%的f1分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Plagiarism Detection in Programming Assignments Using Deep Features

This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-rnn model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 4th IAPR Asian Conference on Pattern Recognition (ACPR)

自引率

0.00%

发文量