Token-based Approach for Real-time Plagiarism Detection in Digital Designs

2018 IEEE Frontiers in Education Conference (FIE) Pub Date : 2018-10-01 DOI:10.1109/FIE.2018.8658531

Han Wan, Kangxu Liu, Xiaopeng Gao

{"title":"Token-based Approach for Real-time Plagiarism Detection in Digital Designs","authors":"Han Wan, Kangxu Liu, Xiaopeng Gao","doi":"10.1109/FIE.2018.8658531","DOIUrl":null,"url":null,"abstract":"This Research to Practice Work in Progress Paper presents a token-based approach to detecting plagiarism in university courses with hardware programming assignments. Detecting plagiarism manually is a difficult and time-consuming work. In the last two decades, various of plagiarism detection tools have been developed. These techniques could be mainly divided into the following categories: Textual Match, Program Dependence Graph Comparison, Abstract Syntax Tree Analysis and Low-Level Form Code Comparison. Although there had been a lot of researches on detecting code clones in software programming languages (e.g. Basic, C/C++, Java, Python, etc.), research that focused on hardware description languages is still lacking. Based on the effective of the locality sensitive hash function (simhash), which was usually used in detecting near duplicates for web crawling, we proposed an improved real-time plagiarism detection approach for Verilog HDL (hardware description language) programming assignments. The core detecting steps are extracting weighted tokens from source code as high-dimensional feature, and mapping it to a f-bit fingerprints with simhash technique. On account of the syntax characteristics of Verilog HDL, a token extraction strategy was designed to maximize the valid information that a fixed length hash value could represent. Experiments over real course data sets were conducted to evaluate the performance of token-based approach comparing with an existing plagiarism detection tool (Moss). The result shows that our token-based approach does qualify the plagiarism detecting job for both online-query and batch-query in digital designs. Furthermore, token-based plagiarism detection approach could enable conduct incremental plagiarism detection for a single submission without excessive overhead. Finally, we also give a discussion of current way limitations and future research directions.","PeriodicalId":354904,"journal":{"name":"2018 IEEE Frontiers in Education Conference (FIE)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Frontiers in Education Conference (FIE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIE.2018.8658531","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This Research to Practice Work in Progress Paper presents a token-based approach to detecting plagiarism in university courses with hardware programming assignments. Detecting plagiarism manually is a difficult and time-consuming work. In the last two decades, various of plagiarism detection tools have been developed. These techniques could be mainly divided into the following categories: Textual Match, Program Dependence Graph Comparison, Abstract Syntax Tree Analysis and Low-Level Form Code Comparison. Although there had been a lot of researches on detecting code clones in software programming languages (e.g. Basic, C/C++, Java, Python, etc.), research that focused on hardware description languages is still lacking. Based on the effective of the locality sensitive hash function (simhash), which was usually used in detecting near duplicates for web crawling, we proposed an improved real-time plagiarism detection approach for Verilog HDL (hardware description language) programming assignments. The core detecting steps are extracting weighted tokens from source code as high-dimensional feature, and mapping it to a f-bit fingerprints with simhash technique. On account of the syntax characteristics of Verilog HDL, a token extraction strategy was designed to maximize the valid information that a fixed length hash value could represent. Experiments over real course data sets were conducted to evaluate the performance of token-based approach comparing with an existing plagiarism detection tool (Moss). The result shows that our token-based approach does qualify the plagiarism detecting job for both online-query and batch-query in digital designs. Furthermore, token-based plagiarism detection approach could enable conduct incremental plagiarism detection for a single submission without excessive overhead. Finally, we also give a discussion of current way limitations and future research directions.

查看原文本刊更多论文

数字设计中基于token的实时抄袭检测方法

本文提出了一种基于令牌的方法来检测大学硬件编程课程中的剽窃行为。手工检测抄袭是一项困难且耗时的工作。在过去的二十年里，各种各样的抄袭检测工具被开发出来。这些技术主要分为以下几类:文本匹配、程序依赖图比较、抽象语法树分析和低级形式代码比较。虽然在软件编程语言(如Basic、C/ c++、Java、Python等)中已经有很多关于代码克隆检测的研究，但针对硬件描述语言的研究还很缺乏。基于局部敏感哈希函数(simhash)的有效性，针对Verilog HDL(硬件描述语言)编程作业，提出了一种改进的实时抄袭检测方法。检测的核心步骤是从源代码中提取加权令牌作为高维特征，并用simhash技术将其映射到f位指纹。考虑到Verilog HDL的语法特点，设计了一种令牌提取策略，以最大化固定长度哈希值可以表示的有效信息。在真实课程数据集上进行了实验，以评估基于令牌的方法与现有抄袭检测工具(Moss)的性能。结果表明，我们的基于令牌的方法可以胜任数字设计中的在线查询和批量查询的抄袭检测工作。此外，基于令牌的抄袭检测方法可以在没有过多开销的情况下对单个提交进行增量抄袭检测。最后，对当前方法的局限性和未来的研究方向进行了讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE Frontiers in Education Conference (FIE)

自引率

0.00%

发文量