Scalable multi-label patent classification via iterative large language model-assisted active learning

IF 1.9 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

World Patent Information Pub Date : 2025-08-06 DOI:10.1016/j.wpi.2025.102380

Songquan Xiong, Shikun Chen, Jianwei He, Yangguang Liu, Junjie Mao, Chao Liu

{"title":"Scalable multi-label patent classification via iterative large language model-assisted active learning","authors":"Songquan Xiong, Shikun Chen, Jianwei He, Yangguang Liu, Junjie Mao, Chao Liu","doi":"10.1016/j.wpi.2025.102380","DOIUrl":null,"url":null,"abstract":"<div><div>Patent classification faces increasingly complex challenges due to the exponential growth in volume and technical sophistication of global patent databases. A substantial proportion of patents inherently belong to multiple technological categories simultaneously, rendering classification particularly challenging for both manual and automated systems. Current approaches struggle with computational scalability, prohibitive annotation costs, and the accurate identification of overlapping technical concepts across interdisciplinary innovations. This study presents a novel iterative framework that combines the advanced text comprehension capabilities of Large Language Models (LLMs) with the sample-efficient principles of active learning (AL) for scalable multi-label patent classification. We evaluated our approach using drone-related technologies extracted from a comprehensive dataset of 100,000 patents, focusing on ten key technological component categories. Our LLM-assisted active learning methodology achieved Macro-F1 and Micro-F1 scores of 0.85 and 0.88, respectively, demonstrating a 15% improvement in Macro-F1 compared to established baseline methods. Our approach reduced the required manual annotation effort by approximately 60% while maintaining comparable classification performance. These empirical findings demonstrate the potential for transforming large-scale patent analysis workflows and improving the efficiency of intellectual property management systems</div></div>","PeriodicalId":51794,"journal":{"name":"World Patent Information","volume":"82 ","pages":"Article 102380"},"PeriodicalIF":1.9000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Patent Information","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S017221902500047X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Patent classification faces increasingly complex challenges due to the exponential growth in volume and technical sophistication of global patent databases. A substantial proportion of patents inherently belong to multiple technological categories simultaneously, rendering classification particularly challenging for both manual and automated systems. Current approaches struggle with computational scalability, prohibitive annotation costs, and the accurate identification of overlapping technical concepts across interdisciplinary innovations. This study presents a novel iterative framework that combines the advanced text comprehension capabilities of Large Language Models (LLMs) with the sample-efficient principles of active learning (AL) for scalable multi-label patent classification. We evaluated our approach using drone-related technologies extracted from a comprehensive dataset of 100,000 patents, focusing on ten key technological component categories. Our LLM-assisted active learning methodology achieved Macro-F1 and Micro-F1 scores of 0.85 and 0.88, respectively, demonstrating a 15% improvement in Macro-F1 compared to established baseline methods. Our approach reduced the required manual annotation effort by approximately 60% while maintaining comparable classification performance. These empirical findings demonstrate the potential for transforming large-scale patent analysis workflows and improving the efficiency of intellectual property management systems

查看原文本刊更多论文

基于迭代大语言模型辅助主动学习的可扩展多标签专利分类

由于全球专利数据库的数量和技术复杂性呈指数级增长，专利分类面临着越来越复杂的挑战。相当大比例的专利本质上同时属于多个技术类别，这使得人工和自动化系统的分类特别具有挑战性。当前的方法与计算可伸缩性、令人望而却步的注释成本以及跨跨学科创新的重叠技术概念的准确识别作斗争。本研究提出了一种新的迭代框架，该框架结合了大型语言模型（llm）的高级文本理解能力和用于可扩展多标签专利分类的主动学习（AL）的样本效率原则。我们使用从10万项专利的综合数据集中提取的无人机相关技术来评估我们的方法，重点关注10个关键技术组件类别。我们的llm辅助主动学习方法的Macro-F1和Micro-F1得分分别为0.85和0.88，与既定的基线方法相比，Macro-F1提高了15%。我们的方法将所需的手动注释工作减少了大约60%，同时保持了相当的分类性能。这些实证研究结果证明了大规模专利分析工作流程的转变和提高知识产权管理系统效率的潜力

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

World Patent Information INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

3.50

自引率

18.50%

发文量

期刊介绍： The aim of World Patent Information is to provide a worldwide forum for the exchange of information between people working professionally in the field of Industrial Property information and documentation and to promote the widest possible use of the associated literature. Regular features include: papers concerned with all aspects of Industrial Property information and documentation; new regulations pertinent to Industrial Property information and documentation; short reports on relevant meetings and conferences; bibliographies, together with book and literature reviews.