{"title":"基于llm的改进多标签分层专利分类","authors":"Bardia Rafieian, Pere-Pau Vázquez","doi":"10.1016/j.wpi.2025.102356","DOIUrl":null,"url":null,"abstract":"<div><div>Classifying multi-label documents has always been a challenging task, especially when the labels follow a hierarchical structure. This complexity increases the difficulty of accurately predicting multiple interrelated labels, often limiting the overall classification performance. To address these challenges and improve the accuracy of hierarchical multi-label classification systems, we introduce a novel pipeline leveraging large language models (LLMs). By incorporating quantization techniques and optimizing weight updates through smaller matrices, we improve computational efficiency and scalability. Our approach demonstrates an improvement in accuracy compared to current state-of-the-art models. In this work, we apply this pipeline to patent documents, focusing on the multi-label hierarchical text classification problem using a transformer-based architecture. The hierarchical structure of the CPC (Cooperative Patent Classification) labels is preserved through a graph-based taxonomy, which enables more effective processing of patent categories. Our model is trained and evaluated on the USPTO-70k dataset, and we achieve substantial improvements across various metrics, including precision, recall, F1-score, and AUC.</div></div>","PeriodicalId":51794,"journal":{"name":"World Patent Information","volume":"81 ","pages":"Article 102356"},"PeriodicalIF":2.2000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improved multi-label hierarchical patent classification using LLMs\",\"authors\":\"Bardia Rafieian, Pere-Pau Vázquez\",\"doi\":\"10.1016/j.wpi.2025.102356\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Classifying multi-label documents has always been a challenging task, especially when the labels follow a hierarchical structure. This complexity increases the difficulty of accurately predicting multiple interrelated labels, often limiting the overall classification performance. To address these challenges and improve the accuracy of hierarchical multi-label classification systems, we introduce a novel pipeline leveraging large language models (LLMs). By incorporating quantization techniques and optimizing weight updates through smaller matrices, we improve computational efficiency and scalability. Our approach demonstrates an improvement in accuracy compared to current state-of-the-art models. In this work, we apply this pipeline to patent documents, focusing on the multi-label hierarchical text classification problem using a transformer-based architecture. The hierarchical structure of the CPC (Cooperative Patent Classification) labels is preserved through a graph-based taxonomy, which enables more effective processing of patent categories. Our model is trained and evaluated on the USPTO-70k dataset, and we achieve substantial improvements across various metrics, including precision, recall, F1-score, and AUC.</div></div>\",\"PeriodicalId\":51794,\"journal\":{\"name\":\"World Patent Information\",\"volume\":\"81 \",\"pages\":\"Article 102356\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Patent Information\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0172219025000237\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INFORMATION SCIENCE & LIBRARY SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Patent Information","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0172219025000237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
Improved multi-label hierarchical patent classification using LLMs
Classifying multi-label documents has always been a challenging task, especially when the labels follow a hierarchical structure. This complexity increases the difficulty of accurately predicting multiple interrelated labels, often limiting the overall classification performance. To address these challenges and improve the accuracy of hierarchical multi-label classification systems, we introduce a novel pipeline leveraging large language models (LLMs). By incorporating quantization techniques and optimizing weight updates through smaller matrices, we improve computational efficiency and scalability. Our approach demonstrates an improvement in accuracy compared to current state-of-the-art models. In this work, we apply this pipeline to patent documents, focusing on the multi-label hierarchical text classification problem using a transformer-based architecture. The hierarchical structure of the CPC (Cooperative Patent Classification) labels is preserved through a graph-based taxonomy, which enables more effective processing of patent categories. Our model is trained and evaluated on the USPTO-70k dataset, and we achieve substantial improvements across various metrics, including precision, recall, F1-score, and AUC.
期刊介绍:
The aim of World Patent Information is to provide a worldwide forum for the exchange of information between people working professionally in the field of Industrial Property information and documentation and to promote the widest possible use of the associated literature. Regular features include: papers concerned with all aspects of Industrial Property information and documentation; new regulations pertinent to Industrial Property information and documentation; short reports on relevant meetings and conferences; bibliographies, together with book and literature reviews.