{"title":"大型语言模型在儿科肾脏病临床决策支持中的性能评价:一项综合评估。","authors":"Olivier Niel, Dishana Dookhun, Ancuta Caliment","doi":"10.1007/s00467-025-06819-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.</p><p><strong>Methods: </strong>Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.</p><p><strong>Results: </strong>Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.</p><p><strong>Conclusions: </strong>While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.</p>","PeriodicalId":19735,"journal":{"name":"Pediatric Nephrology","volume":" ","pages":"3211-3218"},"PeriodicalIF":2.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment.\",\"authors\":\"Olivier Niel, Dishana Dookhun, Ancuta Caliment\",\"doi\":\"10.1007/s00467-025-06819-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.</p><p><strong>Methods: </strong>Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.</p><p><strong>Results: </strong>Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.</p><p><strong>Conclusions: </strong>While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.</p>\",\"PeriodicalId\":19735,\"journal\":{\"name\":\"Pediatric Nephrology\",\"volume\":\" \",\"pages\":\"3211-3218\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pediatric Nephrology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00467-025-06819-w\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/3 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"PEDIATRICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric Nephrology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00467-025-06819-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/3 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PEDIATRICS","Score":null,"Total":0}
Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment.
Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.
Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.
Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.
Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.
期刊介绍:
International Pediatric Nephrology Association
Pediatric Nephrology publishes original clinical research related to acute and chronic diseases that affect renal function, blood pressure, and fluid and electrolyte disorders in children. Studies may involve medical, surgical, nutritional, physiologic, biochemical, genetic, pathologic or immunologic aspects of disease, imaging techniques or consequences of acute or chronic kidney disease. There are 12 issues per year that contain Editorial Commentaries, Reviews, Educational Reviews, Original Articles, Brief Reports, Rapid Communications, Clinical Quizzes, and Letters to the Editors.