Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases.

IF 10.8 1区 医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL
Leonardo Chimirri, J Harry Caufield, Yasemin Bridges, Nicolas Matentzoglu, Michael Gargano, Mario Cazalla, Shihan Chen, Daniel Danis, Alexander J M Dingemans, Klara Gehle, Petra Gehle, Adam S L Graefe, Weihong Gu, Markus S Ladewig, Pablo Lapunzina, Julián Nevado, Enock Niyonkuru, Soichi Ogishima, Dominik Seelow, Jair A Tenorio Castaño, Marek Turnovec, Bert B A de Vries, Kai Wang, Kyran Wissink, Zafer Yüksel, Gabriele Zucca, Melissa A Haendel, Christopher J Mungall, Justin Reese, Peter N Robinson
{"title":"Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases.","authors":"Leonardo Chimirri, J Harry Caufield, Yasemin Bridges, Nicolas Matentzoglu, Michael Gargano, Mario Cazalla, Shihan Chen, Daniel Danis, Alexander J M Dingemans, Klara Gehle, Petra Gehle, Adam S L Graefe, Weihong Gu, Markus S Ladewig, Pablo Lapunzina, Julián Nevado, Enock Niyonkuru, Soichi Ogishima, Dominik Seelow, Jair A Tenorio Castaño, Marek Turnovec, Bert B A de Vries, Kai Wang, Kyran Wissink, Zafer Yüksel, Gabriele Zucca, Melissa A Haendel, Christopher J Mungall, Justin Reese, Peter N Robinson","doi":"10.1016/j.ebiom.2025.105957","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.</p><p><strong>Methods: </strong>We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.</p><p><strong>Findings: </strong>For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages.</p><p><strong>Interpretation: </strong>The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings.</p><p><strong>Funding: </strong>NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).</p>","PeriodicalId":11494,"journal":{"name":"EBioMedicine","volume":"121 ","pages":"105957"},"PeriodicalIF":10.8000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EBioMedicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.ebiom.2025.105957","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.

Methods: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.

Findings: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages.

Interpretation: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings.

Funding: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).

大语言模型在10种语言4917例罕见病诊断中的一致性表现
背景:大型语言模型(llm)越来越多地用于医学的各种应用,包括鉴别诊断支持。用于创建法学硕士的训练数据,如生成式预训练转换器(GPT),主要由英语文本组成,但如果语言障碍能够克服,法学硕士可以在全球范围内用于支持诊断。llm用于非英语语言的鉴别诊断的初步试点研究显示出了希望,但缺乏对这些模型在各种欧洲和非欧洲语言中在具有挑战性的罕见疾病病例的综合语料库中的相对性能的大规模评估。方法:我们使用人类表型本体(HPO)术语和全球基因组学与健康联盟(GA4GH)表型包模式捕获的结构化数据创建了4917个临床小片段。这些临床小插曲跨越总共360种不同的遗传疾病与2525相关的表型特征。我们使用人类表型本体的翻译和语言特定的模板来生成英语、中文、捷克语、荷兰语、法语、德语、意大利语、日语、西班牙语和土耳其语的提示。我们使用gpt- 40,版本gpt- 40 -2024-08-06,以及经过医学微调的Meditron3-70B,通过零射击提示提供分级鉴别诊断。使用基于本体的Mondo疾病本体方法来映射同义词和将疾病亚型映射到临床诊断,以便自动评估LLM反应。结果:对于英语,gpt - 40将正确诊断排在第1位的比例为19.9%,前3位的比例为27.0%。相比之下,对于这里测试的九种非英语语言,正确诊断排在第1位的比例在16.9%到20.6%之间,排在前3位的比例在25.4%到28.6%之间。在Meditron3模型中,英语病例的正确率为20.9%,其他9种语言的正确率为19.9%至24.0%。解释:llm在罕见疾病病例综合语料库中的鉴别诊断性能在测试的十种语言中基本一致。这表明法学硕士在临床环境中的效用可以扩展到非英语临床环境。资助项目:NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805和R24OD011883。P.N.R.得到亚历山大·冯·洪堡基金会教授职位的资助;P.L.得到了国家基金(PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER)的支持。c.m., J.R.和J.H.C.得到了美国能源部基础能源科学办公室主任的部分支持。DE-AC0205CH11231)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
EBioMedicine
EBioMedicine Biochemistry, Genetics and Molecular Biology-General Biochemistry,Genetics and Molecular Biology
CiteScore
17.70
自引率
0.90%
发文量
579
审稿时长
5 weeks
期刊介绍: eBioMedicine is a comprehensive biomedical research journal that covers a wide range of studies that are relevant to human health. Our focus is on original research that explores the fundamental factors influencing human health and disease, including the discovery of new therapeutic targets and treatments, the identification of biomarkers and diagnostic tools, and the investigation and modification of disease pathways and mechanisms. We welcome studies from any biomedical discipline that contribute to our understanding of disease and aim to improve human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信