{"title":"Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl","authors":"Stella Verkijk, Piek Vossen","doi":"10.1016/j.artmed.2025.103148","DOIUrl":null,"url":null,"abstract":"<div><div>Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.</div></div>","PeriodicalId":55458,"journal":{"name":"Artificial Intelligence in Medicine","volume":"167 ","pages":"Article 103148"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0933365725000831","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.
期刊介绍:
Artificial Intelligence in Medicine publishes original articles from a wide variety of interdisciplinary perspectives concerning the theory and practice of artificial intelligence (AI) in medicine, medically-oriented human biology, and health care.
Artificial intelligence in medicine may be characterized as the scientific discipline pertaining to research studies, projects, and applications that aim at supporting decision-based medical tasks through knowledge- and/or data-intensive computer-based solutions that ultimately support and improve the performance of a human care provider.