Encoder models at the European Patent Office: Pre-training and use cases

IF 2.2 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE
Volker D. Hähnke, Arnaud Wéry, Matthias Wirth, Alexander Klenner-Bajaja
{"title":"Encoder models at the European Patent Office: Pre-training and use cases","authors":"Volker D. Hähnke,&nbsp;Arnaud Wéry,&nbsp;Matthias Wirth,&nbsp;Alexander Klenner-Bajaja","doi":"10.1016/j.wpi.2025.102360","DOIUrl":null,"url":null,"abstract":"<div><div>Patents are organized using systems of technical concepts like the Cooperative Patent Classification. Classification information is extremely valuable for patent professionals, particularly for patent search. Language models have proven useful in Natural Language Processing tasks, including document classification. Generally, pre-training on a domain is essential for optimal downstream performance. Currently, there are no models pre-trained on patents with sequence length above 512. We pre-trained a RoBERTa model with sequence length 1024, increasing the fully covered claims sections from 12% to 53%. It has a ‘base’ configuration, reducing free parameters compared to ‘large’ models in the patent domain three-fold. We fine-tuned the model on classification tasks in the CPC, up to leaf level. Our tokenizer produces sequences on average 5% and up to 10% shorter than the general English RoBERTa tokenizer. With our pre-trained ‘base’ size model, we reach classification performance better than general English models, comparable to ‘large’ models pre-trained on patents. On the finest CPC granularity, 88% of test documents have at least one ground truth symbol in the top 10 predictions. Our CPC prediction models and data sets are publicly accessible. With the described procedures, we can periodically repeat pre-training and fine-tuning to cope with drift effects.</div></div>","PeriodicalId":51794,"journal":{"name":"World Patent Information","volume":"81 ","pages":"Article 102360"},"PeriodicalIF":2.2000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Patent Information","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0172219025000274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Patents are organized using systems of technical concepts like the Cooperative Patent Classification. Classification information is extremely valuable for patent professionals, particularly for patent search. Language models have proven useful in Natural Language Processing tasks, including document classification. Generally, pre-training on a domain is essential for optimal downstream performance. Currently, there are no models pre-trained on patents with sequence length above 512. We pre-trained a RoBERTa model with sequence length 1024, increasing the fully covered claims sections from 12% to 53%. It has a ‘base’ configuration, reducing free parameters compared to ‘large’ models in the patent domain three-fold. We fine-tuned the model on classification tasks in the CPC, up to leaf level. Our tokenizer produces sequences on average 5% and up to 10% shorter than the general English RoBERTa tokenizer. With our pre-trained ‘base’ size model, we reach classification performance better than general English models, comparable to ‘large’ models pre-trained on patents. On the finest CPC granularity, 88% of test documents have at least one ground truth symbol in the top 10 predictions. Our CPC prediction models and data sets are publicly accessible. With the described procedures, we can periodically repeat pre-training and fine-tuning to cope with drift effects.
欧洲专利局编码器模型:预训练和用例
专利是通过技术概念系统(如专利合作分类)组织起来的。分类信息对专利专业人员来说非常宝贵,尤其是在专利检索方面。事实证明,语言模型在自然语言处理任务(包括文档分类)中非常有用。一般来说,对某一领域进行预训练对于优化下游性能至关重要。目前,还没有针对序列长度超过 512 的专利进行预训练的模型。我们对序列长度为 1024 的 RoBERTa 模型进行了预训练,将完全覆盖的权利要求部分从 12% 增加到 53%。该模型采用 "基础 "配置,与专利领域的 "大型 "模型相比,自由参数减少了三倍。我们在 CPC 的分类任务中对模型进行了微调,直至叶级。我们的标记符号生成器生成的序列比一般的英语 RoBERTa 标记符号生成器平均短 5%,最多可短 10%。使用我们预先训练好的 "基本 "大小模型,我们的分类性能比一般英语模型更好,可与预先训练好的专利 "大 "模型相媲美。在最细的 CPC 粒度上,88% 的测试文档在前 10 项预测中至少有一个地面实况符号。我们的 CPC 预测模型和数据集均可公开访问。利用所述程序,我们可以定期重复预训练和微调,以应对漂移效应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
World Patent Information
World Patent Information INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
3.50
自引率
18.50%
发文量
40
期刊介绍: The aim of World Patent Information is to provide a worldwide forum for the exchange of information between people working professionally in the field of Industrial Property information and documentation and to promote the widest possible use of the associated literature. Regular features include: papers concerned with all aspects of Industrial Property information and documentation; new regulations pertinent to Industrial Property information and documentation; short reports on relevant meetings and conferences; bibliographies, together with book and literature reviews.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信