Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus
{"title":"大规模开放源码软件许可证识别:使用《代码世界》的综合数据集","authors":"Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus","doi":"arxiv-2409.04824","DOIUrl":null,"url":null,"abstract":"The proliferation of open source software (OSS) has led to a complex\nlandscape of licensing practices, making accurate license identification\ncrucial for legal and compliance purposes. This study presents a comprehensive\nanalysis of OSS licenses using the World of Code (WoC) infrastructure. We\nemploy an exhaustive approach, scanning all files containing ``license'' in\ntheir filepath, and apply the winnowing algorithm for robust text matching. Our\nmethod identifies and matches over 5.5 million distinct license blobs across\nmillions of OSS projects, creating a detailed project-to-license (P2L) map. We\nverify the accuracy of our approach through stratified sampling and manual\nreview, achieving a final accuracy of 92.08%, with precision of 87.14%, recall\nof 95.45%, and an F1 score of 91.11%. This work enhances the understanding of\nOSS licensing practices and provides a valuable resource for developers,\nresearchers, and legal professionals. Future work will expand the scope of\nlicense detection to include code files and references to licenses in project\ndocumentation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OSS License Identification at Scale: A Comprehensive Dataset Using World of Code\",\"authors\":\"Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus\",\"doi\":\"arxiv-2409.04824\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The proliferation of open source software (OSS) has led to a complex\\nlandscape of licensing practices, making accurate license identification\\ncrucial for legal and compliance purposes. This study presents a comprehensive\\nanalysis of OSS licenses using the World of Code (WoC) infrastructure. We\\nemploy an exhaustive approach, scanning all files containing ``license'' in\\ntheir filepath, and apply the winnowing algorithm for robust text matching. Our\\nmethod identifies and matches over 5.5 million distinct license blobs across\\nmillions of OSS projects, creating a detailed project-to-license (P2L) map. We\\nverify the accuracy of our approach through stratified sampling and manual\\nreview, achieving a final accuracy of 92.08%, with precision of 87.14%, recall\\nof 95.45%, and an F1 score of 91.11%. This work enhances the understanding of\\nOSS licensing practices and provides a valuable resource for developers,\\nresearchers, and legal professionals. Future work will expand the scope of\\nlicense detection to include code files and references to licenses in project\\ndocumentation.\",\"PeriodicalId\":501278,\"journal\":{\"name\":\"arXiv - CS - Software Engineering\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04824\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04824","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
OSS License Identification at Scale: A Comprehensive Dataset Using World of Code
The proliferation of open source software (OSS) has led to a complex
landscape of licensing practices, making accurate license identification
crucial for legal and compliance purposes. This study presents a comprehensive
analysis of OSS licenses using the World of Code (WoC) infrastructure. We
employ an exhaustive approach, scanning all files containing ``license'' in
their filepath, and apply the winnowing algorithm for robust text matching. Our
method identifies and matches over 5.5 million distinct license blobs across
millions of OSS projects, creating a detailed project-to-license (P2L) map. We
verify the accuracy of our approach through stratified sampling and manual
review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall
of 95.45%, and an F1 score of 91.11%. This work enhances the understanding of
OSS licensing practices and provides a valuable resource for developers,
researchers, and legal professionals. Future work will expand the scope of
license detection to include code files and references to licenses in project
documentation.