Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-07-08 DOI:10.2196/76773

Seiji Shimizu, Tomohiro Nishiyama, Hiroyuki Nagai, Shoko Wakamiya, Eiji Aramaki

{"title":"Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.","authors":"Seiji Shimizu, Tomohiro Nishiyama, Hiroyuki Nagai, Shoko Wakamiya, Eiji Aramaki","doi":"10.2196/76773","DOIUrl":null,"url":null,"abstract":"Background: Disease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs from training data in writing styles and clinical contexts, has not been thoroughly investigated.Objective: This study evaluated the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was not included during training.Methods: We used two corpora for this study: (1) a publicly available set of Japanese case reports denoted as CR, and (2) a newly constructed corpus of progress notes, denoted as PN, written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compared a LLM and a baseline-masked language model (MLM). These models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type, similar to training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap (ie, the performance drop from in-domain to out-of-domain settings).Results: The LLM demonstrated greater robustness, with a smaller performance gap in F1-scores (ID-OOD = -8.6) compared to the MLM baseline performance (ID-OOD = -13.9). This indicated more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLMs for reliable use in diverse clinical settings.Conclusions: Fine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes, with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical natural language processing in low-resource language settings and support their deployment in real-world health care applications, where diversity in documentation is inevitable.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e76773"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12262928/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/76773","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Disease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs from training data in writing styles and clinical contexts, has not been thoroughly investigated.

Objective: This study evaluated the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was not included during training.

Methods: We used two corpora for this study: (1) a publicly available set of Japanese case reports denoted as CR, and (2) a newly constructed corpus of progress notes, denoted as PN, written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compared a LLM and a baseline-masked language model (MLM). These models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type, similar to training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap (ie, the performance drop from in-domain to out-of-domain settings).

Results: The LLM demonstrated greater robustness, with a smaller performance gap in F1-scores (ID-OOD = -8.6) compared to the MLM baseline performance (ID-OOD = -13.9). This indicated more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLMs for reliable use in diverse clinical settings.

Conclusions: Fine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes, with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical natural language processing in low-resource language settings and support their deployment in real-world health care applications, where diversity in documentation is inevitable.

查看原文本刊更多论文

面向跨医院部署的自然语言处理系统：日语疾病名称识别的模型开发和验证微调大语言模型。

背景：疾病名称识别是临床自然语言处理中的一项基本任务，可以从电子健康记录中提取关键的患者信息。虽然大型语言模型（llm）的最新进展显示出了希望，但大多数评估都集中在英语上，对于它们在日语等低资源语言上的稳健性知之甚少。特别是，这些模型是否能够可靠地处理以前未见过的院内数据，这与写作风格和临床背景的训练数据不同，尚未得到彻底的调查。目的：本研究评估了微调llm在日本临床记录中对疾病名称识别的稳健性，特别关注了它们在培训期间未包括的住院数据上的表现。方法：我们在这项研究中使用了两个语料库：(1)一组公开的日本病例报告，标记为CR；(2)一个新构建的进度记录语料库，标记为PN，由十位医生撰写，以捕捉院内临床记录的风格变化。为了反映真实的部署场景，我们首先对CR上的模型进行了微调，具体来说，我们比较了LLM和基线屏蔽语言模型（MLM）。然后在两种条件下对这些模型进行评估：(1)CR，表示具有相同文档类型的域内（ID）设置，与训练相似；(2)PN，表示具有不同文档类型的域外（OOD）设置。鲁棒性通过计算性能差距来评估（即，从域内到域外设置的性能下降）。结果：LLM表现出更强的稳健性，与MLM基线表现（ID-OOD = -13.9）相比，f1得分（ID-OOD = -8.6）的表现差距较小。这表明在ID和OOD设置中性能更稳定，突出了微调llm在不同临床设置中可靠使用的有效性。结论：微调llm在日本临床记录中表现出优越的疾病名称识别稳健性，性能差距较小。这些发现突出了llm作为低资源语言环境下临床自然语言处理的可靠工具的潜力，并支持其在现实世界医疗保健应用中的部署，其中文档的多样性是不可避免的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.