大型语言模型在儿科肾脏病临床决策支持中的性能评价：一项综合评估。

IF 2.6 3区医学 Q1 PEDIATRICS

Pediatric Nephrology Pub Date : 2025-10-01 Epub Date: 2025-06-03 DOI:10.1007/s00467-025-06819-w

Olivier Niel, Dishana Dookhun, Ancuta Caliment

{"title":"大型语言模型在儿科肾脏病临床决策支持中的性能评价：一项综合评估。","authors":"Olivier Niel, Dishana Dookhun, Ancuta Caliment","doi":"10.1007/s00467-025-06819-w","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.","PeriodicalId":19735,"journal":{"name":"Pediatric Nephrology","volume":" ","pages":"3211-3218"},"PeriodicalIF":2.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment.\",\"authors\":\"Olivier Niel, Dishana Dookhun, Ancuta Caliment\",\"doi\":\"10.1007/s00467-025-06819-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.\",\"PeriodicalId\":19735,\"journal\":{\"name\":\"Pediatric Nephrology\",\"volume\":\" \",\"pages\":\"3211-3218\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pediatric Nephrology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00467-025-06819-w\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/3 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric Nephrology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00467-025-06819-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：随着人工智能的进步，大型语言模型（llm）已经成为医疗保健领域的潜在工具。尽管在多个医学专业中有很好的应用前景，但关于LLM在儿科肾脏病学中的实施的研究有限。本研究评估当代法学硕士在支持儿科肾病专家临床决策过程中的表现。方法：根据国际标准，由专家设计并验证10例涵盖儿科肾病各方面的综合临床病例。每个病例都包括诊断、生物/成像探索、治疗和逻辑的问题。10个llm被评估，包括通用模型（Claude、ChatGPT、Gemini、DeepSeek、Mistral、Copilot、Perplexity、Phi 4）和一个专门模型（Phi 4 Nomic），使用检索增强生成和经过验证的儿科肾脏病材料进行微调。绩效评估基于准确性、个性化、内部矛盾、幻觉和潜在危险的决定。结果：总体准确率为50.8% (Gemini) ~ 86.9% (Claude)，平均66.24%。Claude显著优于其他模型（p = 0.01）。个性化得分在50% （ChatGPT）和85% （Claude）之间变化。所有模型都表现出幻觉（2-8次）和可能危及生命的决定（0-2次）。特定领域的微调在不增强推理能力的情况下提高了所有临床标准的性能。性能可变性很小，性能更高的模型显示出更大的一致性。结论：虽然某些llm在儿科肾病学应用中显示出有希望的准确性，但包括幻觉和潜在危险建议在内的持续挑战阻碍了自主临床实施。llm目前可能在重复性任务中起辅助作用，但在临床实践中应在严格监督下使用。在更广泛的临床整合之前，解决幻觉缓解和可解释性的未来进展是必要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment.

Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.

Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.

Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.

Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pediatric Nephrology 医学-泌尿学与肾脏学

CiteScore

4.70

自引率

20.00%

发文量

465

审稿时长

1 months

期刊介绍： International Pediatric Nephrology Association Pediatric Nephrology publishes original clinical research related to acute and chronic diseases that affect renal function, blood pressure, and fluid and electrolyte disorders in children. Studies may involve medical, surgical, nutritional, physiologic, biochemical, genetic, pathologic or immunologic aspects of disease, imaging techniques or consequences of acute or chronic kidney disease. There are 12 issues per year that contain Editorial Commentaries, Reviews, Educational Reviews, Original Articles, Brief Reports, Rapid Communications, Clinical Quizzes, and Letters to the Editors.