数字设计中基于token的实时抄袭检测方法

Han Wan, Kangxu Liu, Xiaopeng Gao
{"title":"数字设计中基于token的实时抄袭检测方法","authors":"Han Wan, Kangxu Liu, Xiaopeng Gao","doi":"10.1109/FIE.2018.8658531","DOIUrl":null,"url":null,"abstract":"This Research to Practice Work in Progress Paper presents a token-based approach to detecting plagiarism in university courses with hardware programming assignments. Detecting plagiarism manually is a difficult and time-consuming work. In the last two decades, various of plagiarism detection tools have been developed. These techniques could be mainly divided into the following categories: Textual Match, Program Dependence Graph Comparison, Abstract Syntax Tree Analysis and Low-Level Form Code Comparison. Although there had been a lot of researches on detecting code clones in software programming languages (e.g. Basic, C/C++, Java, Python, etc.), research that focused on hardware description languages is still lacking. Based on the effective of the locality sensitive hash function (simhash), which was usually used in detecting near duplicates for web crawling, we proposed an improved real-time plagiarism detection approach for Verilog HDL (hardware description language) programming assignments. The core detecting steps are extracting weighted tokens from source code as high-dimensional feature, and mapping it to a f-bit fingerprints with simhash technique. On account of the syntax characteristics of Verilog HDL, a token extraction strategy was designed to maximize the valid information that a fixed length hash value could represent. Experiments over real course data sets were conducted to evaluate the performance of token-based approach comparing with an existing plagiarism detection tool (Moss). The result shows that our token-based approach does qualify the plagiarism detecting job for both online-query and batch-query in digital designs. Furthermore, token-based plagiarism detection approach could enable conduct incremental plagiarism detection for a single submission without excessive overhead. Finally, we also give a discussion of current way limitations and future research directions.","PeriodicalId":354904,"journal":{"name":"2018 IEEE Frontiers in Education Conference (FIE)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Token-based Approach for Real-time Plagiarism Detection in Digital Designs\",\"authors\":\"Han Wan, Kangxu Liu, Xiaopeng Gao\",\"doi\":\"10.1109/FIE.2018.8658531\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This Research to Practice Work in Progress Paper presents a token-based approach to detecting plagiarism in university courses with hardware programming assignments. Detecting plagiarism manually is a difficult and time-consuming work. In the last two decades, various of plagiarism detection tools have been developed. These techniques could be mainly divided into the following categories: Textual Match, Program Dependence Graph Comparison, Abstract Syntax Tree Analysis and Low-Level Form Code Comparison. Although there had been a lot of researches on detecting code clones in software programming languages (e.g. Basic, C/C++, Java, Python, etc.), research that focused on hardware description languages is still lacking. Based on the effective of the locality sensitive hash function (simhash), which was usually used in detecting near duplicates for web crawling, we proposed an improved real-time plagiarism detection approach for Verilog HDL (hardware description language) programming assignments. The core detecting steps are extracting weighted tokens from source code as high-dimensional feature, and mapping it to a f-bit fingerprints with simhash technique. On account of the syntax characteristics of Verilog HDL, a token extraction strategy was designed to maximize the valid information that a fixed length hash value could represent. Experiments over real course data sets were conducted to evaluate the performance of token-based approach comparing with an existing plagiarism detection tool (Moss). The result shows that our token-based approach does qualify the plagiarism detecting job for both online-query and batch-query in digital designs. Furthermore, token-based plagiarism detection approach could enable conduct incremental plagiarism detection for a single submission without excessive overhead. Finally, we also give a discussion of current way limitations and future research directions.\",\"PeriodicalId\":354904,\"journal\":{\"name\":\"2018 IEEE Frontiers in Education Conference (FIE)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Frontiers in Education Conference (FIE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FIE.2018.8658531\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Frontiers in Education Conference (FIE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIE.2018.8658531","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

本文提出了一种基于令牌的方法来检测大学硬件编程课程中的剽窃行为。手工检测抄袭是一项困难且耗时的工作。在过去的二十年里,各种各样的抄袭检测工具被开发出来。这些技术主要分为以下几类:文本匹配、程序依赖图比较、抽象语法树分析和低级形式代码比较。虽然在软件编程语言(如Basic、C/ c++、Java、Python等)中已经有很多关于代码克隆检测的研究,但针对硬件描述语言的研究还很缺乏。基于局部敏感哈希函数(simhash)的有效性,针对Verilog HDL(硬件描述语言)编程作业,提出了一种改进的实时抄袭检测方法。检测的核心步骤是从源代码中提取加权令牌作为高维特征,并用simhash技术将其映射到f位指纹。考虑到Verilog HDL的语法特点,设计了一种令牌提取策略,以最大化固定长度哈希值可以表示的有效信息。在真实课程数据集上进行了实验,以评估基于令牌的方法与现有抄袭检测工具(Moss)的性能。结果表明,我们的基于令牌的方法可以胜任数字设计中的在线查询和批量查询的抄袭检测工作。此外,基于令牌的抄袭检测方法可以在没有过多开销的情况下对单个提交进行增量抄袭检测。最后,对当前方法的局限性和未来的研究方向进行了讨论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Token-based Approach for Real-time Plagiarism Detection in Digital Designs
This Research to Practice Work in Progress Paper presents a token-based approach to detecting plagiarism in university courses with hardware programming assignments. Detecting plagiarism manually is a difficult and time-consuming work. In the last two decades, various of plagiarism detection tools have been developed. These techniques could be mainly divided into the following categories: Textual Match, Program Dependence Graph Comparison, Abstract Syntax Tree Analysis and Low-Level Form Code Comparison. Although there had been a lot of researches on detecting code clones in software programming languages (e.g. Basic, C/C++, Java, Python, etc.), research that focused on hardware description languages is still lacking. Based on the effective of the locality sensitive hash function (simhash), which was usually used in detecting near duplicates for web crawling, we proposed an improved real-time plagiarism detection approach for Verilog HDL (hardware description language) programming assignments. The core detecting steps are extracting weighted tokens from source code as high-dimensional feature, and mapping it to a f-bit fingerprints with simhash technique. On account of the syntax characteristics of Verilog HDL, a token extraction strategy was designed to maximize the valid information that a fixed length hash value could represent. Experiments over real course data sets were conducted to evaluate the performance of token-based approach comparing with an existing plagiarism detection tool (Moss). The result shows that our token-based approach does qualify the plagiarism detecting job for both online-query and batch-query in digital designs. Furthermore, token-based plagiarism detection approach could enable conduct incremental plagiarism detection for a single submission without excessive overhead. Finally, we also give a discussion of current way limitations and future research directions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信