OSS License Identification at Scale: A Comprehensive Dataset Using World of Code

Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus
{"title":"OSS License Identification at Scale: A Comprehensive Dataset Using World of Code","authors":"Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus","doi":"arxiv-2409.04824","DOIUrl":null,"url":null,"abstract":"The proliferation of open source software (OSS) has led to a complex\nlandscape of licensing practices, making accurate license identification\ncrucial for legal and compliance purposes. This study presents a comprehensive\nanalysis of OSS licenses using the World of Code (WoC) infrastructure. We\nemploy an exhaustive approach, scanning all files containing ``license'' in\ntheir filepath, and apply the winnowing algorithm for robust text matching. Our\nmethod identifies and matches over 5.5 million distinct license blobs across\nmillions of OSS projects, creating a detailed project-to-license (P2L) map. We\nverify the accuracy of our approach through stratified sampling and manual\nreview, achieving a final accuracy of 92.08%, with precision of 87.14%, recall\nof 95.45%, and an F1 score of 91.11%. This work enhances the understanding of\nOSS licensing practices and provides a valuable resource for developers,\nresearchers, and legal professionals. Future work will expand the scope of\nlicense detection to include code files and references to licenses in project\ndocumentation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04824","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The proliferation of open source software (OSS) has led to a complex landscape of licensing practices, making accurate license identification crucial for legal and compliance purposes. This study presents a comprehensive analysis of OSS licenses using the World of Code (WoC) infrastructure. We employ an exhaustive approach, scanning all files containing ``license'' in their filepath, and apply the winnowing algorithm for robust text matching. Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map. We verify the accuracy of our approach through stratified sampling and manual review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall of 95.45%, and an F1 score of 91.11%. This work enhances the understanding of OSS licensing practices and provides a valuable resource for developers, researchers, and legal professionals. Future work will expand the scope of license detection to include code files and references to licenses in project documentation.
大规模开放源码软件许可证识别:使用《代码世界》的综合数据集
开放源码软件(OSS)的激增导致了许可实践的复杂局面,使得准确的许可识别对于法律和合规目的至关重要。本研究利用 "代码世界"(WoC)基础设施对开放源码软件许可证进行了全面分析。我们采用了一种详尽的方法,扫描文件路径中包含 "许可证 "的所有文件,并应用筛选算法进行稳健的文本匹配。我们的方法在数百万个开放源码软件项目中识别并匹配了 550 多万个不同的许可证,创建了详细的项目到许可证(P2L)地图。我们通过分层抽样和人工审核验证了我们方法的准确性,最终准确率达到 92.08%,精确度为 87.14%,回收率为 95.45%,F1 分数为 91.11%。这项工作加深了人们对OSS 许可实践的理解,为开发人员、研究人员和法律专业人员提供了宝贵的资源。未来的工作将扩大许可证检测的范围,以包括代码文件和项目文档中的许可证引用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信