使用基于lcs的源代码相似度识别跨存储库的源代码重用

Naohiro Kawamitsu, T. Ishio, Tetsuya Kanda, R. Kula, Coen De Roover, Katsuro Inoue
{"title":"使用基于lcs的源代码相似度识别跨存储库的源代码重用","authors":"Naohiro Kawamitsu, T. Ishio, Tetsuya Kanda, R. Kula, Coen De Roover, Katsuro Inoue","doi":"10.1109/SCAM.2014.17","DOIUrl":null,"url":null,"abstract":"Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.","PeriodicalId":407060,"journal":{"name":"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"Identifying Source Code Reuse across Repositories Using LCS-Based Source Code Similarity\",\"authors\":\"Naohiro Kawamitsu, T. Ishio, Tetsuya Kanda, R. Kula, Coen De Roover, Katsuro Inoue\",\"doi\":\"10.1109/SCAM.2014.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.\",\"PeriodicalId\":407060,\"journal\":{\"name\":\"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SCAM.2014.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2014.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31

摘要

开发人员经常重用为另一个项目开发的源文件。为了将重用的文件更新为原始项目发布的新版本,开发人员必须跟踪重用了文件的哪个修订版本以及其内容是如何修改的。然而,这样的跟踪对开发人员来说是乏味的。许多项目保留旧版本的文件,这些文件的错误在原始项目中已经修复。在本文中,我们提出了一种自动识别两个存储库之间的源代码重用关系的技术。使用基于最长公共子序列的相似性度量,我们识别存储库中文件的相似修订对。为了评估我们的方法,我们分析了八个开源软件项目对,并将结果与存储库中的记录信息进行了比较。因此,我们已经确定了1394个文件修订作为源代码重用的实例。虽然75.3%的实例被记录在存储库中,但20.1%的实例未被记录,但通过我们的方法进行了恢复。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identifying Source Code Reuse across Repositories Using LCS-Based Source Code Similarity
Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信