Repo2Vec:一种确定存储库相似度的综合嵌入方法

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2021-07-11 DOI:10.26226/morressier.613b5418842293c031b5b614

Md Omar Faruk Rokon, Pei Yan, Risul Islam, M. Faloutsos

{"title":"Repo2Vec:一种确定存储库相似度的综合嵌入方法","authors":"Md Omar Faruk Rokon, Pei Yan, Risul Islam, M. Faloutsos","doi":"10.26226/morressier.613b5418842293c031b5b614","DOIUrl":null,"url":null,"abstract":"How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determining repository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by ML algorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a) metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93 % vs 78 %), with nearly twice as many Strongly Similar repositories and 30 % fewer False Positives. Second, we show how Repo2Vec provides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98 % precision, and 96 % recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity\",\"authors\":\"Md Omar Faruk Rokon, Pei Yan, Risul Islam, M. Faloutsos\",\"doi\":\"10.26226/morressier.613b5418842293c031b5b614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determining repository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by ML algorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a) metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93 % vs 78 %), with nearly twice as many Strongly Similar repositories and 30 % fewer False Positives. Second, we show how Repo2Vec provides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98 % precision, and 96 % recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.\",\"PeriodicalId\":205629,\"journal\":{\"name\":\"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.26226/morressier.613b5418842293c031b5b614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26226/morressier.613b5418842293c031b5b614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

我们如何在大型在线存档(如GitHub)中识别类似的存储库和集群?确定存储库相似性是研究此类软件生态系统的动态和进化的重要组成部分。关键的挑战是确定不同存储库特性的正确表示方式:(a)它捕获可用信息的所有方面，以及(b)它易于被ML算法使用。我们提出了Repo2Vec，这是一种综合嵌入方法，通过结合三种类型信息源的特征，将存储库表示为分布式向量。作为我们的关键创新点，我们考虑了三种类型的信息:(a)元数据，(b)存储库的结构，以及(c)源代码。我们还介绍了一系列嵌入方法来表示和组合这些信息类型到单个嵌入中。我们使用来自GitHub的两个真实数据集对合并后的1013个存储库进行了评估。首先，我们表明我们的方法在精度方面优于以前的方法(93%对78%)，具有几乎两倍的强相似库和30%的误报。其次，我们展示了Repo2Vec如何为以下方面提供坚实的基础:(a)区分恶意软件和良性存储库，以及(b)确定有意义的分层聚类。例如，我们在区分恶意软件和良性存储库方面达到98%的准确率和96%的召回率。总的来说，我们的工作是支持许多存储库分析功能的基本构建块，例如根据目标平台或意图对存储库进行分类，检测代码重用和克隆，以及识别沿袭和进化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determining repository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by ML algorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a) metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93 % vs 78 %), with nearly twice as many Strongly Similar repositories and 30 % fewer False Positives. Second, we show how Repo2Vec provides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98 % precision, and 96 % recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量