Joeran S. Bosma, Koen Dercksen, Luc Builtjes, Romain André, Christian Roest, Stefan J. Fransen, Constant R. Noordman, Mar Navarro-Padilla, Judith Lefkes, Natália Alves, Max J. J. de Grauw, Leander van Eekelen, Joey M. A. Spronck, Megan Schuurmans, Bram de Wilde, Ward Hendrix, Witali Aswolinskiy, Anindo Saha, Jasper J. Twilt, Daan Geijs, Jeroen Veltman, Derya Yakar, Maarten de Rooij, Francesco Ciompi, Alessa Hering, Jeroen Geerdink, Henkjan Huisman
{"title":"The DRAGON benchmark for clinical NLP","authors":"Joeran S. Bosma, Koen Dercksen, Luc Builtjes, Romain André, Christian Roest, Stefan J. Fransen, Constant R. Noordman, Mar Navarro-Padilla, Judith Lefkes, Natália Alves, Max J. J. de Grauw, Leander van Eekelen, Joey M. A. Spronck, Megan Schuurmans, Bram de Wilde, Ward Hendrix, Witali Aswolinskiy, Anindo Saha, Jasper J. Twilt, Daan Geijs, Jeroen Veltman, Derya Yakar, Maarten de Rooij, Francesco Ciompi, Alessa Hering, Jeroen Geerdink, Henkjan Huisman","doi":"10.1038/s41746-025-01626-x","DOIUrl":null,"url":null,"abstract":"<p>Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, <i>p</i> < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.</p>","PeriodicalId":19349,"journal":{"name":"NPJ Digital Medicine","volume":"57 1","pages":""},"PeriodicalIF":12.4000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Digital Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41746-025-01626-x","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.
期刊介绍:
npj Digital Medicine is an online open-access journal that focuses on publishing peer-reviewed research in the field of digital medicine. The journal covers various aspects of digital medicine, including the application and implementation of digital and mobile technologies in clinical settings, virtual healthcare, and the use of artificial intelligence and informatics.
The primary goal of the journal is to support innovation and the advancement of healthcare through the integration of new digital and mobile technologies. When determining if a manuscript is suitable for publication, the journal considers four important criteria: novelty, clinical relevance, scientific rigor, and digital innovation.