Schemes for Labeling Semantic Code Clones using Machine Learning

Abdullah M. Sheneamer, H. Hazazi, Swarup Roy, J. Kalita
{"title":"Schemes for Labeling Semantic Code Clones using Machine Learning","authors":"Abdullah M. Sheneamer, H. Hazazi, Swarup Roy, J. Kalita","doi":"10.1109/ICMLA.2017.00-25","DOIUrl":null,"url":null,"abstract":"Machine learning approaches built to identify code clones fail to perform well due to insufficient training samples and have been restricted only up to Type-III clones. A majority of the publicly available code clone corpora are incomplete in nature and lack labeled samples for semantic or Type-IV clones. We present here two schemes for labeling all types of clones including Type-IV clones. We restrict our study to Java code only. First, we use an unsupervised approach to label Type-IV clones and validate them using expert Java programmers. Next, we present a supervised scheme for labeling (or classifying) unknown samples based on labeled samples derived from our first scheme. We evaluate the performance of our schemes using six well-known Java code clone corpora and report on the quality of produced clones in terms of kappa agreement, mean error and accuracy scores. Results show that both schemes produce high quality code clones facilitating future use of machine learning in detecting clones of Type-IV.","PeriodicalId":6636,"journal":{"name":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"76 1","pages":"981-985"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2017.00-25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Machine learning approaches built to identify code clones fail to perform well due to insufficient training samples and have been restricted only up to Type-III clones. A majority of the publicly available code clone corpora are incomplete in nature and lack labeled samples for semantic or Type-IV clones. We present here two schemes for labeling all types of clones including Type-IV clones. We restrict our study to Java code only. First, we use an unsupervised approach to label Type-IV clones and validate them using expert Java programmers. Next, we present a supervised scheme for labeling (or classifying) unknown samples based on labeled samples derived from our first scheme. We evaluate the performance of our schemes using six well-known Java code clone corpora and report on the quality of produced clones in terms of kappa agreement, mean error and accuracy scores. Results show that both schemes produce high quality code clones facilitating future use of machine learning in detecting clones of Type-IV.
使用机器学习标记语义代码克隆的方案
由于训练样本不足,用于识别代码克隆的机器学习方法不能很好地执行,并且仅限于iii型克隆。大多数公开可用的代码克隆语料库本质上是不完整的,并且缺乏用于语义或类型iv克隆的标记样本。我们在这里提出了两种标记包括iv型克隆在内的所有类型克隆的方案。我们只研究Java代码。首先,我们使用一种无监督的方法来标记类型iv克隆,并使用专业的Java程序员对它们进行验证。接下来,我们提出了一种基于第一种方案衍生的标记样本的标记(或分类)未知样本的监督方案。我们使用六个知名的Java代码克隆语料库来评估我们的方案的性能,并根据kappa协议、平均误差和准确性分数报告生成的克隆的质量。结果表明,这两种方案都产生了高质量的代码克隆,有助于未来在检测iv型克隆时使用机器学习。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信