Dingkun Li, Minghao Piao, H. Shon, K. Ryu, Incheon Paik
{"title":"一遍预处理基于令牌的源代码克隆检测","authors":"Dingkun Li, Minghao Piao, H. Shon, K. Ryu, Incheon Paik","doi":"10.1109/ICAWST.2014.6981824","DOIUrl":null,"url":null,"abstract":"Token-based source code clones detection provides a promising way to detect the source code duplication and re-dundancy. While preprocessing of clone detection plays an important role in KDD for further processing as the old saying goes: well begun is half done. However, processing unstructured source code files of large software systems is really challenging and time or space consuming. This paper introduces a novel way to clean, tokenize and transform the source code into the appropriate form for mining. A tool called OPP (One Pass Preprocessor) has been developed to preprocess the source code files efficiently and flexibly. The paper experimented on three large open source projects like Wildfly1.02 Linux core-3.6, VTK of different host languages, and the result showed that our tool has great power and flexibility to preprocess the source code files and products high quality output.","PeriodicalId":359404,"journal":{"name":"2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"One pass preprocessing for token-based source code clone detection\",\"authors\":\"Dingkun Li, Minghao Piao, H. Shon, K. Ryu, Incheon Paik\",\"doi\":\"10.1109/ICAWST.2014.6981824\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Token-based source code clones detection provides a promising way to detect the source code duplication and re-dundancy. While preprocessing of clone detection plays an important role in KDD for further processing as the old saying goes: well begun is half done. However, processing unstructured source code files of large software systems is really challenging and time or space consuming. This paper introduces a novel way to clean, tokenize and transform the source code into the appropriate form for mining. A tool called OPP (One Pass Preprocessor) has been developed to preprocess the source code files efficiently and flexibly. The paper experimented on three large open source projects like Wildfly1.02 Linux core-3.6, VTK of different host languages, and the result showed that our tool has great power and flexibility to preprocess the source code files and products high quality output.\",\"PeriodicalId\":359404,\"journal\":{\"name\":\"2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAWST.2014.6981824\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAWST.2014.6981824","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
基于令牌的源代码克隆检测为检测源代码复制和冗余提供了一种很有前途的方法。而克隆检测的预处理在KDD的进一步处理中起着重要的作用,俗话说:好的开始是成功的一半。然而,处理大型软件系统的非结构化源代码文件确实具有挑战性,并且耗费时间或空间。本文介绍了一种新的方法来清理、标记源代码并将其转换为合适的挖掘形式。为了高效灵活地对源代码文件进行预处理,开发了一种名为OPP (One Pass Preprocessor)的工具。本文在不同主机语言的Wildfly1.02 Linux core-3.6、VTK三个大型开源项目上进行了实验,结果表明我们的工具对源代码文件的预处理具有强大的功能和灵活性,并能产生高质量的输出。
One pass preprocessing for token-based source code clone detection
Token-based source code clones detection provides a promising way to detect the source code duplication and re-dundancy. While preprocessing of clone detection plays an important role in KDD for further processing as the old saying goes: well begun is half done. However, processing unstructured source code files of large software systems is really challenging and time or space consuming. This paper introduces a novel way to clean, tokenize and transform the source code into the appropriate form for mining. A tool called OPP (One Pass Preprocessor) has been developed to preprocess the source code files efficiently and flexibly. The paper experimented on three large open source projects like Wildfly1.02 Linux core-3.6, VTK of different host languages, and the result showed that our tool has great power and flexibility to preprocess the source code files and products high quality output.