Automated Detection of Password Leakage from Public GitHub Repositories

2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) Pub Date : 2022-05-01 DOI:10.1145/3510003.3510150

Runhan Feng, Ziyang Yan, Shiyan Peng, Yuanyuan Zhang

{"title":"Automated Detection of Password Leakage from Public GitHub Repositories","authors":"Runhan Feng, Ziyang Yan, Shiyan Peng, Yuanyuan Zhang","doi":"10.1145/3510003.3510150","DOIUrl":null,"url":null,"abstract":"The prosperity of the GitHub community has raised new concerns about data security in public repositories. Practitioners who manage authentication secrets such as textual passwords and API keys in the source code may accidentally leave these texts in the public repositories, resulting in secret leakage. If such leakage in the source code can be automatically detected in time, potential damage would be avoided. With existing approaches focusing on detecting secrets with distinctive formats (e.g., API keys, cryptographic keys in PEM format), textual passwords, which are ubiquitously used for authentication, fall through the crack. Given that textual passwords could be virtually any strings, a naive detection scheme based on regular expression performs poorly. This paper presents PassFinder, an automated approach to effectively detecting password leakage from public repositories that involve various programming languages on a large scale. PassFinder utilizes deep neural networks to unveil the intrinsic characteristics of textual passwords and understand the semantics of the code snippets that use textual passwords for authentication, i.e., the contextual information of the passwords in the source code. Using this new technique, we performed the first large-scale and longitudinal analysis of password leakage on GitHub. We inspected newly uploaded public code files on GitHub for 75 days and found that password leakage is pervasive, affecting over sixty thousand repositories. Our work contributes to a better understanding of password leakage on GitHub, and we believe our technique could promote the security of the open-source ecosystem.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510003.3510150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

The prosperity of the GitHub community has raised new concerns about data security in public repositories. Practitioners who manage authentication secrets such as textual passwords and API keys in the source code may accidentally leave these texts in the public repositories, resulting in secret leakage. If such leakage in the source code can be automatically detected in time, potential damage would be avoided. With existing approaches focusing on detecting secrets with distinctive formats (e.g., API keys, cryptographic keys in PEM format), textual passwords, which are ubiquitously used for authentication, fall through the crack. Given that textual passwords could be virtually any strings, a naive detection scheme based on regular expression performs poorly. This paper presents PassFinder, an automated approach to effectively detecting password leakage from public repositories that involve various programming languages on a large scale. PassFinder utilizes deep neural networks to unveil the intrinsic characteristics of textual passwords and understand the semantics of the code snippets that use textual passwords for authentication, i.e., the contextual information of the passwords in the source code. Using this new technique, we performed the first large-scale and longitudinal analysis of password leakage on GitHub. We inspected newly uploaded public code files on GitHub for 75 days and found that password leakage is pervasive, affecting over sixty thousand repositories. Our work contributes to a better understanding of password leakage on GitHub, and we believe our technique could promote the security of the open-source ecosystem.

查看原文本刊更多论文

自动检测密码泄漏从公共GitHub仓库

GitHub社区的繁荣引发了对公共存储库数据安全的新担忧。管理源代码中的文本密码和API密钥等身份验证秘密的从业者可能会意外地将这些文本留在公共存储库中，从而导致秘密泄露。如果能够及时自动检测到源代码中的这种泄漏，就可以避免潜在的损害。现有的方法侧重于检测具有不同格式的秘密(例如，API密钥，PEM格式的加密密钥)，而用于身份验证的文本密码则被忽略了。考虑到文本密码实际上可以是任何字符串，基于正则表达式的朴素检测方案的性能很差。本文介绍PassFinder，一种自动方法，可以有效地检测大规模涉及各种编程语言的公共存储库中的密码泄漏。PassFinder利用深度神经网络揭示文本密码的内在特征，并理解使用文本密码进行身份验证的代码片段的语义，即源代码中密码的上下文信息。使用这种新技术，我们对GitHub上的密码泄露进行了第一次大规模的纵向分析。我们对GitHub上新上传的公开代码文件进行了75天的检查，发现密码泄露非常普遍，影响了6万多个存储库。我们的工作有助于更好地理解GitHub上的密码泄露，我们相信我们的技术可以促进开源生态系统的安全性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量