Naohiro Kawamitsu, T. Ishio, Tetsuya Kanda, R. Kula, Coen De Roover, Katsuro Inoue
{"title":"Identifying Source Code Reuse across Repositories Using LCS-Based Source Code Similarity","authors":"Naohiro Kawamitsu, T. Ishio, Tetsuya Kanda, R. Kula, Coen De Roover, Katsuro Inoue","doi":"10.1109/SCAM.2014.17","DOIUrl":null,"url":null,"abstract":"Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.","PeriodicalId":407060,"journal":{"name":"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2014.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31
Abstract
Developers often reuse source files developed for another project. In order to update a reused file to a newer version released by the original project, developers have to track which revision of a file was reused and how its content was modified. However, such tracking is tedious for developers. Many projects keep older versions of files whose bugs are already fixed in the original project. In this paper, we propose a technique to automatically identify source code reuse relationships between two repositories. Using a similarity metric based on longest common subsequence, we identify pairs of similar revisions of files across the repositories. To evaluate our approach, we have analyzed eight project pairs of open source software projects and compared the result with the recorded information in the repositories. As a result, we have identified 1394 file revisions as instances of source code reuse. While 75.3% of the instances are recorded in the repositories, 20.1% of the instances are unrecorded but recovered by our approach.