Detecting similar repositories on GitHub

Yun Zhang, D. Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, Jianling Sun
{"title":"Detecting similar repositories on GitHub","authors":"Yun Zhang, D. Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, Jianling Sun","doi":"10.1109/SANER.2017.7884605","DOIUrl":null,"url":null,"abstract":"GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN.","PeriodicalId":6541,"journal":{"name":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"72 1","pages":"13-23"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2017.7884605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 66

Abstract

GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN.
检测GitHub上类似的存储库
GitHub包含数百万个库,其中许多库彼此相似(即具有相似的源代码或实现相似的功能)。在GitHub上找到类似的存储库对软件工程师很有帮助,因为它可以帮助他们重用源代码、构建原型、确定替代实现、探索相关项目、找到要贡献的项目,以及发现代码盗窃和剽窃。以前的研究已经提出了通过分析API使用模式和软件标签来检测类似应用程序的技术。然而,这些先前的研究要么只利用了有限的信息来源,要么使用了GitHub上项目不可用的信息。在本文中,我们提出了一种新的方法,可以有效地检测GitHub上的类似存储库。我们的方法是基于利用两个数据源(即GitHub stars和readme文件)的三种启发式设计的,这在以前的作品中没有考虑到。这三种启发式方法是:自述文件包含相似内容的存储库可能彼此相似,兴趣相似的用户标记的存储库可能相似,同一用户在短时间内标记在一起的存储库可能相似。基于这三种启发式方法,我们计算了三个相关性分数(即基于自述的相关性、基于观星者的相关性和基于时间的相关性)来评估两个存储库之间的相似性。通过整合三个相关分数,我们构建了一个名为RepoPal的推荐系统来检测相似的存储库。我们将RepoPal与之前使用GitHub上的1000个Java存储库的最先进的方法CLAN进行了比较。我们的实证评估表明,RepoPal比CLAN具有更高的成功率、精度和置信度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信