Identification and classification of free, open source software licenses: A systematic literature review

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-09-19 DOI:10.1016/j.jss.2025.112628

Sergio Montes-Leon, Gregorio Robles, Jesus M. Gonzalez-Barahona

{"title":"Identification and classification of free, open source software licenses: A systematic literature review","authors":"Sergio Montes-Leon, Gregorio Robles, Jesus M. Gonzalez-Barahona","doi":"10.1016/j.jss.2025.112628","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Licenses are a fundamental element of free, open source software (FOSS), since they express the permissions granted to those receiving the software. Therefore, identification of licenses in source code, and their analysis, is crucial to understand the legal implications of using and distributing FOSS. This has not been ignored by many researchers, who have devoted attention to the topic of license identification and analysis.</div></div><div><h3>Goal:</h3><div>To learn how researchers have identified and classified licenses in FOSS, including which techniques and tools they have used. We were also interested in the evolution of these techniques and tools over time, and the public datasets available in this realm.</div></div><div><h3>Method:</h3><div>We conducted a Systematic Literature Review, which resulted in 50 scientific publications which we analyzed.</div></div><div><h3>Results:</h3><div>We observed that most studies focus on the use or development of specific tools. However, there is a recurring concern about the need to improve these tools, and the techniques they use. Studies presented (and therefore, tools and techniques presented) are usually empirically validated. With respect to techniques, we found that the use of machine-learning techniques is still relatively scarce, with most papers presenting studies based on pattern matching and similar techniques. It is also interesting that reuse of tools is relatively high, and that many of these tools remain available. However, benchmarking studies highlight some specific tools, which, perhaps for that reason, are becoming more common in publications. The availability of datasets oriented towards license identification is limited, but very large datasets have been published during the last years.</div></div><div><h3>Conclusions:</h3><div>Data scarcity and a reliance on existing tools pose significant challenges for this research area. The relatively low use of machine learning techniques, and the scarcity of studies related to the classification of license texts open interesting opportunities for research, which is facilitated by the recent availability of large datasets. Additionally, researchers can also benefit from readily available tools for tasks like comparison and benchmarking.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112628"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002973","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Background:

Licenses are a fundamental element of free, open source software (FOSS), since they express the permissions granted to those receiving the software. Therefore, identification of licenses in source code, and their analysis, is crucial to understand the legal implications of using and distributing FOSS. This has not been ignored by many researchers, who have devoted attention to the topic of license identification and analysis.

Goal:

To learn how researchers have identified and classified licenses in FOSS, including which techniques and tools they have used. We were also interested in the evolution of these techniques and tools over time, and the public datasets available in this realm.

Method:

We conducted a Systematic Literature Review, which resulted in 50 scientific publications which we analyzed.

Results:

We observed that most studies focus on the use or development of specific tools. However, there is a recurring concern about the need to improve these tools, and the techniques they use. Studies presented (and therefore, tools and techniques presented) are usually empirically validated. With respect to techniques, we found that the use of machine-learning techniques is still relatively scarce, with most papers presenting studies based on pattern matching and similar techniques. It is also interesting that reuse of tools is relatively high, and that many of these tools remain available. However, benchmarking studies highlight some specific tools, which, perhaps for that reason, are becoming more common in publications. The availability of datasets oriented towards license identification is limited, but very large datasets have been published during the last years.

Conclusions:

Data scarcity and a reliance on existing tools pose significant challenges for this research area. The relatively low use of machine learning techniques, and the scarcity of studies related to the classification of license texts open interesting opportunities for research, which is facilitated by the recent availability of large datasets. Additionally, researchers can also benefit from readily available tools for tasks like comparison and benchmarking.

查看原文本刊更多论文

自由、开源软件许可证的识别和分类：系统的文献综述

背景：许可证是自由开源软件（FOSS）的基本元素，因为它们表达了授予接收软件的人的权限。因此，识别源代码中的许可证及其分析对于理解使用和分发自由/开源软件的法律含义至关重要。这一点并没有被许多研究人员所忽视，他们一直致力于许可证识别和分析的主题。目标：了解研究人员如何识别和分类自由/开源软件中的许可证，包括他们使用的技术和工具。我们也对这些技术和工具随时间的演变，以及这个领域中可用的公共数据集感兴趣。方法：我们进行了系统的文献综述，收集了50篇科学出版物进行分析。结果：我们观察到大多数研究集中在特定工具的使用或开发上。然而，对于需要改进这些工具和它们所使用的技术，有一个反复出现的担忧。所提出的研究（因此，所提出的工具和技术）通常是经验验证的。在技术方面，我们发现机器学习技术的使用仍然相对较少，大多数论文都是基于模式匹配和类似技术的研究。同样有趣的是，工具的重用率相对较高，而且其中许多工具仍然可用。然而，基准研究强调了一些特定的工具，也许正是由于这个原因，这些工具在出版物中变得越来越普遍。面向许可证标识的数据集的可用性是有限的，但是在过去几年中已经发布了非常大的数据集。结论：数据稀缺和对现有工具的依赖对这一研究领域构成了重大挑战。机器学习技术的使用率相对较低，以及与许可文本分类相关的研究的稀缺性为研究提供了有趣的机会，这得益于最近大型数据集的可用性。此外，研究人员还可以利用现成的工具进行比较和基准测试等任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.