Copyright Law and the Lifecycle of Machine Learning Models

IIC - International Review of Intellectual Property and Competition Law Pub Date : 2024-02-01 DOI:10.2139/ssrn.4670563

Martin Kretschmer, Thomas Margoni, Pinar Oruc

{"title":"Copyright Law and the Lifecycle of Machine Learning Models","authors":"Martin Kretschmer, Thomas Margoni, Pinar Oruc","doi":"10.2139/ssrn.4670563","DOIUrl":null,"url":null,"abstract":"Machine learning, a subfield of artificial intelligence (AI), relies on large corpora of data as input for learning algorithms, resulting in trained models that can perform a variety of tasks. While data or information are not subject matter within copyright law, almost all materials used to construct corpora for machine learning are protected by copyright law: texts, images, videos, and so on. There are global policy moves to address the copyright implications of machine learning, in particular in the context of so-called “foundation models” that underpin generative AI. This paper takes a step back, exploring empirically three technological settings through detailed case studies. We set out the established industry methodology of a lifecycle of AI (collecting data, organising data, model training, model operation) to arrive at descriptions suitable for legal analysis. This will allow an assessment of the challenges for a harmonisation of rights, exceptions and disclosure under EU copyright law. The three case studies are: 1. Machine learning for scientific purposes, in the context of a study of regional short-term letting markets; 2. Natural Language Processing (NLP), in the context of large language models; 3. Computer vision, in the context of content moderation of images. We find that the nature and quality of data corpora at the input stage is central to the lifecycle of machine learning. Because of the uncertain legal status of data collection and processing, combined with the competitive advantage gained by firms not disclosing technological advances, the inputs of the models deployed are often unknown. Moreover, the “lawful access” requirement of the EU exception for text and data mining may turn the exception into a decision by rightholders to allow machine learning in the context of their decision to allow access. We assess policy interventions at EU level, seeking to clarify the legal status of input data via copyright exceptions, opt-outs or the forced disclosure of copyright materials. We find that the likely result is a fully copyright-licensed environment of machine learning that may have problematic effects for the structure of industry, innovation and scientific research.","PeriodicalId":505904,"journal":{"name":"IIC - International Review of Intellectual Property and Competition Law","volume":"30 26","pages":"1-29"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IIC - International Review of Intellectual Property and Competition Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.4670563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning, a subfield of artificial intelligence (AI), relies on large corpora of data as input for learning algorithms, resulting in trained models that can perform a variety of tasks. While data or information are not subject matter within copyright law, almost all materials used to construct corpora for machine learning are protected by copyright law: texts, images, videos, and so on. There are global policy moves to address the copyright implications of machine learning, in particular in the context of so-called “foundation models” that underpin generative AI. This paper takes a step back, exploring empirically three technological settings through detailed case studies. We set out the established industry methodology of a lifecycle of AI (collecting data, organising data, model training, model operation) to arrive at descriptions suitable for legal analysis. This will allow an assessment of the challenges for a harmonisation of rights, exceptions and disclosure under EU copyright law. The three case studies are: 1. Machine learning for scientific purposes, in the context of a study of regional short-term letting markets; 2. Natural Language Processing (NLP), in the context of large language models; 3. Computer vision, in the context of content moderation of images. We find that the nature and quality of data corpora at the input stage is central to the lifecycle of machine learning. Because of the uncertain legal status of data collection and processing, combined with the competitive advantage gained by firms not disclosing technological advances, the inputs of the models deployed are often unknown. Moreover, the “lawful access” requirement of the EU exception for text and data mining may turn the exception into a decision by rightholders to allow machine learning in the context of their decision to allow access. We assess policy interventions at EU level, seeking to clarify the legal status of input data via copyright exceptions, opt-outs or the forced disclosure of copyright materials. We find that the likely result is a fully copyright-licensed environment of machine learning that may have problematic effects for the structure of industry, innovation and scientific research.

查看原文本刊更多论文

版权法与机器学习模型的生命周期

机器学习是人工智能（AI）的一个子领域，它依赖于大量的数据作为学习算法的输入，从而产生可以执行各种任务的训练有素的模型。虽然数据或信息不属于版权法的保护对象，但几乎所有用于构建机器学习语料库的材料都受到版权法的保护：文本、图像、视频等。全球都在采取政策措施来解决机器学习的版权问题，特别是在所谓的 "基础模型 "方面，这些模型是生成式人工智能的基础。本文退后一步，通过详细的案例研究对三种技术环境进行了实证探索。我们列出了人工智能生命周期（收集数据、组织数据、模型训练、模型运行）的既定行业方法，以得出适合法律分析的描述。这将有助于评估欧盟版权法在协调权利、例外和披露方面所面临的挑战。三个案例研究分别是1.2. 大型语言模型方面的自然语言处理 (NLP)；3. 图像内容控制方面的计算机视觉。我们发现，输入阶段数据体的性质和质量对机器学习的生命周期至关重要。由于数据收集和处理的法律地位不确定，再加上企业不公开技术进步所获得的竞争优势，所部署模型的输入往往是未知的。此外，欧盟对文本和数据挖掘例外的 "合法获取 "要求可能会将例外变成权利人在决定允许获取时允许机器学习的决定。我们评估了欧盟层面的政策干预措施，试图通过版权例外、选择退出或强制披露版权材料来明确输入数据的法律地位。我们发现，这样做的可能结果是机器学习环境完全由版权许可，这可能会对产业结构、创新和科学研究产生问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IIC - International Review of Intellectual Property and Competition Law

自引率

0.00%

发文量