Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl

IF 6.2 2区 医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Stella Verkijk, Piek Vossen
{"title":"Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl","authors":"Stella Verkijk,&nbsp;Piek Vossen","doi":"10.1016/j.artmed.2025.103148","DOIUrl":null,"url":null,"abstract":"<div><div>Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.</div></div>","PeriodicalId":55458,"journal":{"name":"Artificial Intelligence in Medicine","volume":"167 ","pages":"Article 103148"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0933365725000831","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.
创建、匿名化和评估在荷兰电子健康记录上预先训练的第一个医学语言模型:MedRoBERTa.nl
电子健康记录(EHRs)包含各种医疗专业人员关于患者健康的各个方面的书面记录。当使用大型语言模型(LLM)进行充分处理时,可以对这些巨大的信息源进行定量分析,从而产生新的见解,例如在治疗开发或患者康复模式中。然而,临床笔记中使用的语言是非常特殊的,这是现有的通用法学硕士在他们的预培训中没有遇到过的。因此,他们没有内部化这些数据的语义的充分表示,这对于构建可靠的自然语言处理(NLP)软件至关重要。本文描述了针对荷兰电子健康档案的第一个特定领域法学硕士的开发:MedRoBERTa.nl。我们详细讨论了为什么以及如何建立我们的模型,使用不同的策略在电子病历的笔记上预训练它,以及我们如何能够通过完全匿名来公开发布它。我们广泛地评估了我们的模型,并将其与其他各种法学硕士进行了比较。我们还说明了如何使用我们的模型,讨论了在我们的模型之上构建医学文本挖掘技术的各种研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Artificial Intelligence in Medicine
Artificial Intelligence in Medicine 工程技术-工程:生物医学
CiteScore
15.00
自引率
2.70%
发文量
143
审稿时长
6.3 months
期刊介绍: Artificial Intelligence in Medicine publishes original articles from a wide variety of interdisciplinary perspectives concerning the theory and practice of artificial intelligence (AI) in medicine, medically-oriented human biology, and health care. Artificial intelligence in medicine may be characterized as the scientific discipline pertaining to research studies, projects, and applications that aim at supporting decision-based medical tasks through knowledge- and/or data-intensive computer-based solutions that ultimately support and improve the performance of a human care provider.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信