CODEIPPROMPT: Intellectual Property Infringement Assessment of Code Language Models.

Proceedings of machine learning research Pub Date : 2023-07-01

Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, Chaowei Xiao

{"title":"CODEIPPROMPT: Intellectual Property Infringement Assessment of Code Language Models.","authors":"Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, Chaowei Xiao","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in large language models (LMs) have facilitated their ability to synthesize programming code. However, they have also raised concerns about intellectual property (IP) rights violations. Despite the significance of this issue, it has been relatively less explored. In this paper, we aim to bridge the gap by presenting CODEIPPROMPT, a platform for automatic evaluation of the extent to which code language models may reproduce licensed programs. It comprises two key components: prompts constructed from a licensed code database to elicit LMs to generate IP-violating code, and a measurement tool to evaluate the extent of IP violation of code LMs. We conducted an extensive evaluation of existing open-source code LMs and commercial products, and revealed the prevalence of IP violations in all these models. We further identified that the root cause is the substantial proportion of training corpus subject to restrictive licenses, resulting from both intentional inclusion and inconsistent license practice in the real world. To address this issue, we also explored potential mitigation strategies, including fine-tuning and dynamic token filtering. Our study provides a testbed for evaluating the IP violation issues of the existing code generation platforms and stresses the need for a better mitigation strategy.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"23 ","pages":"40373-40389"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377501/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in large language models (LMs) have facilitated their ability to synthesize programming code. However, they have also raised concerns about intellectual property (IP) rights violations. Despite the significance of this issue, it has been relatively less explored. In this paper, we aim to bridge the gap by presenting CODEIPPROMPT, a platform for automatic evaluation of the extent to which code language models may reproduce licensed programs. It comprises two key components: prompts constructed from a licensed code database to elicit LMs to generate IP-violating code, and a measurement tool to evaluate the extent of IP violation of code LMs. We conducted an extensive evaluation of existing open-source code LMs and commercial products, and revealed the prevalence of IP violations in all these models. We further identified that the root cause is the substantial proportion of training corpus subject to restrictive licenses, resulting from both intentional inclusion and inconsistent license practice in the real world. To address this issue, we also explored potential mitigation strategies, including fine-tuning and dynamic token filtering. Our study provides a testbed for evaluating the IP violation issues of the existing code generation platforms and stresses the need for a better mitigation strategy.

本刊更多论文

CODEIPPROMPT：代码语言模型的知识产权侵权评估。

大型语言模型（LMs）的最新进展促进了它们合成编程代码的能力。然而，它们也引起了对侵犯知识产权（IP）权利的担忧。尽管这个问题很重要，但人们对它的探索相对较少。在本文中，我们的目标是通过提供CODEIPPROMPT来弥合差距，CODEIPPROMPT是一个自动评估代码语言模型可以复制许可程序的程度的平台。它包括两个关键组件：从许可代码数据库构建提示符，以诱导LMs生成侵犯IP的代码，以及评估代码LMs侵犯IP的程度的测量工具。我们对现有的开源代码lm和商业产品进行了广泛的评估，并揭示了所有这些模型中普遍存在的知识产权侵权行为。我们进一步发现，根本原因是很大一部分训练语料库受制于限制性许可，这是由于在现实世界中有意包含和不一致的许可实践造成的。为了解决这个问题，我们还探索了潜在的缓解策略，包括微调和动态令牌过滤。我们的研究为评估现有代码生成平台的知识产权侵权问题提供了一个测试平台，并强调需要更好的缓解策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量