Daniele Bifolco , Simone Romano , Sabato Nocera , Rita Francese , Giuseppe Scanniello , Massimiliano Di Penta
{"title":"实证研究GitHub依赖图的准确性及其不准确性的本质","authors":"Daniele Bifolco , Simone Romano , Sabato Nocera , Rita Francese , Giuseppe Scanniello , Massimiliano Di Penta","doi":"10.1016/j.infsof.2025.107854","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>GitHub’s dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness.</div></div><div><h3>Objective:</h3><div>This paper quantitatively and qualitatively analyzes the accuracy of GitHub’s dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are.</div></div><div><h3>Method:</h3><div>After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files.</div></div><div><h3>Results:</h3><div>Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator.</div></div><div><h3>Conclusion:</h3><div>Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"187 ","pages":"Article 107854"},"PeriodicalIF":4.3000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An empirical study on the accuracy of GitHub’s dependency graph and the nature of its inaccuracy\",\"authors\":\"Daniele Bifolco , Simone Romano , Sabato Nocera , Rita Francese , Giuseppe Scanniello , Massimiliano Di Penta\",\"doi\":\"10.1016/j.infsof.2025.107854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>GitHub’s dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness.</div></div><div><h3>Objective:</h3><div>This paper quantitatively and qualitatively analyzes the accuracy of GitHub’s dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are.</div></div><div><h3>Method:</h3><div>After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files.</div></div><div><h3>Results:</h3><div>Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator.</div></div><div><h3>Conclusion:</h3><div>Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"187 \",\"pages\":\"Article 107854\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925001934\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001934","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
An empirical study on the accuracy of GitHub’s dependency graph and the nature of its inaccuracy
Context:
GitHub’s dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness.
Objective:
This paper quantitatively and qualitatively analyzes the accuracy of GitHub’s dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are.
Method:
After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files.
Results:
Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator.
Conclusion:
Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.