Volker D. Hähnke, Arnaud Wéry, Matthias Wirth, Alexander Klenner-Bajaja
{"title":"Encoder models at the European Patent Office: Pre-training and use cases","authors":"Volker D. Hähnke, Arnaud Wéry, Matthias Wirth, Alexander Klenner-Bajaja","doi":"10.1016/j.wpi.2025.102360","DOIUrl":null,"url":null,"abstract":"<div><div>Patents are organized using systems of technical concepts like the Cooperative Patent Classification. Classification information is extremely valuable for patent professionals, particularly for patent search. Language models have proven useful in Natural Language Processing tasks, including document classification. Generally, pre-training on a domain is essential for optimal downstream performance. Currently, there are no models pre-trained on patents with sequence length above 512. We pre-trained a RoBERTa model with sequence length 1024, increasing the fully covered claims sections from 12% to 53%. It has a ‘base’ configuration, reducing free parameters compared to ‘large’ models in the patent domain three-fold. We fine-tuned the model on classification tasks in the CPC, up to leaf level. Our tokenizer produces sequences on average 5% and up to 10% shorter than the general English RoBERTa tokenizer. With our pre-trained ‘base’ size model, we reach classification performance better than general English models, comparable to ‘large’ models pre-trained on patents. On the finest CPC granularity, 88% of test documents have at least one ground truth symbol in the top 10 predictions. Our CPC prediction models and data sets are publicly accessible. With the described procedures, we can periodically repeat pre-training and fine-tuning to cope with drift effects.</div></div>","PeriodicalId":51794,"journal":{"name":"World Patent Information","volume":"81 ","pages":"Article 102360"},"PeriodicalIF":2.2000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Patent Information","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0172219025000274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Patents are organized using systems of technical concepts like the Cooperative Patent Classification. Classification information is extremely valuable for patent professionals, particularly for patent search. Language models have proven useful in Natural Language Processing tasks, including document classification. Generally, pre-training on a domain is essential for optimal downstream performance. Currently, there are no models pre-trained on patents with sequence length above 512. We pre-trained a RoBERTa model with sequence length 1024, increasing the fully covered claims sections from 12% to 53%. It has a ‘base’ configuration, reducing free parameters compared to ‘large’ models in the patent domain three-fold. We fine-tuned the model on classification tasks in the CPC, up to leaf level. Our tokenizer produces sequences on average 5% and up to 10% shorter than the general English RoBERTa tokenizer. With our pre-trained ‘base’ size model, we reach classification performance better than general English models, comparable to ‘large’ models pre-trained on patents. On the finest CPC granularity, 88% of test documents have at least one ground truth symbol in the top 10 predictions. Our CPC prediction models and data sets are publicly accessible. With the described procedures, we can periodically repeat pre-training and fine-tuning to cope with drift effects.
期刊介绍:
The aim of World Patent Information is to provide a worldwide forum for the exchange of information between people working professionally in the field of Industrial Property information and documentation and to promote the widest possible use of the associated literature. Regular features include: papers concerned with all aspects of Industrial Property information and documentation; new regulations pertinent to Industrial Property information and documentation; short reports on relevant meetings and conferences; bibliographies, together with book and literature reviews.