Sergio Montes-Leon, Gregorio Robles, Jesus M. Gonzalez-Barahona
{"title":"Identification and classification of free, open source software licenses: A systematic literature review","authors":"Sergio Montes-Leon, Gregorio Robles, Jesus M. Gonzalez-Barahona","doi":"10.1016/j.jss.2025.112628","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Licenses are a fundamental element of free, open source software (FOSS), since they express the permissions granted to those receiving the software. Therefore, identification of licenses in source code, and their analysis, is crucial to understand the legal implications of using and distributing FOSS. This has not been ignored by many researchers, who have devoted attention to the topic of license identification and analysis.</div></div><div><h3>Goal:</h3><div>To learn how researchers have identified and classified licenses in FOSS, including which techniques and tools they have used. We were also interested in the evolution of these techniques and tools over time, and the public datasets available in this realm.</div></div><div><h3>Method:</h3><div>We conducted a Systematic Literature Review, which resulted in 50 scientific publications which we analyzed.</div></div><div><h3>Results:</h3><div>We observed that most studies focus on the use or development of specific tools. However, there is a recurring concern about the need to improve these tools, and the techniques they use. Studies presented (and therefore, tools and techniques presented) are usually empirically validated. With respect to techniques, we found that the use of machine-learning techniques is still relatively scarce, with most papers presenting studies based on pattern matching and similar techniques. It is also interesting that reuse of tools is relatively high, and that many of these tools remain available. However, benchmarking studies highlight some specific tools, which, perhaps for that reason, are becoming more common in publications. The availability of datasets oriented towards license identification is limited, but very large datasets have been published during the last years.</div></div><div><h3>Conclusions:</h3><div>Data scarcity and a reliance on existing tools pose significant challenges for this research area. The relatively low use of machine learning techniques, and the scarcity of studies related to the classification of license texts open interesting opportunities for research, which is facilitated by the recent availability of large datasets. Additionally, researchers can also benefit from readily available tools for tasks like comparison and benchmarking.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112628"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002973","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
Licenses are a fundamental element of free, open source software (FOSS), since they express the permissions granted to those receiving the software. Therefore, identification of licenses in source code, and their analysis, is crucial to understand the legal implications of using and distributing FOSS. This has not been ignored by many researchers, who have devoted attention to the topic of license identification and analysis.
Goal:
To learn how researchers have identified and classified licenses in FOSS, including which techniques and tools they have used. We were also interested in the evolution of these techniques and tools over time, and the public datasets available in this realm.
Method:
We conducted a Systematic Literature Review, which resulted in 50 scientific publications which we analyzed.
Results:
We observed that most studies focus on the use or development of specific tools. However, there is a recurring concern about the need to improve these tools, and the techniques they use. Studies presented (and therefore, tools and techniques presented) are usually empirically validated. With respect to techniques, we found that the use of machine-learning techniques is still relatively scarce, with most papers presenting studies based on pattern matching and similar techniques. It is also interesting that reuse of tools is relatively high, and that many of these tools remain available. However, benchmarking studies highlight some specific tools, which, perhaps for that reason, are becoming more common in publications. The availability of datasets oriented towards license identification is limited, but very large datasets have been published during the last years.
Conclusions:
Data scarcity and a reliance on existing tools pose significant challenges for this research area. The relatively low use of machine learning techniques, and the scarcity of studies related to the classification of license texts open interesting opportunities for research, which is facilitated by the recent availability of large datasets. Additionally, researchers can also benefit from readily available tools for tasks like comparison and benchmarking.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.