Large Language Models in Neurological Practice: Real-World Study.

IF 6 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-09-22 DOI:10.2196/73212

Natale Vincenzo Maiorana, Sara Marceglia, Mauro Treddenti, Mattia Tosi, Matteo Guidetti, Maria Francesca Creta, Tommaso Bocci, Serena Oliveri, Filippo Martinelli Boneschi, Alberto Priori

{"title":"Large Language Models in Neurological Practice: Real-World Study.","authors":"Natale Vincenzo Maiorana, Sara Marceglia, Mauro Treddenti, Mattia Tosi, Matteo Guidetti, Maria Francesca Creta, Tommaso Bocci, Serena Oliveri, Filippo Martinelli Boneschi, Alberto Priori","doi":"10.2196/73212","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) such as ChatGPT (OpenAI) and Gemini (Google) are increasingly explored for their potential in medical diagnostics, including neurology. Their real-world applicability remains inadequately assessed, particularly in clinical workflows where nuanced decision-making is required.Objective: This study aims to evaluate the diagnostic accuracy and appropriateness of clinical recommendations provided by not-specifically-trained, freely available ChatGPT and Gemini, compared to neurologists, using real-world clinical cases.Methods: This study consisted of an experimental evaluation of LLMs' diagnostic performance presenting real-world neurology cases to ChatGPT and Gemini, comparing their performance with that of clinical neurologists. The study was conducted simulating a first visit using information from anonymized patient records from the Neurology Department of the ASST Santi Paolo e Carlo Hospital, ensuring a real-world clinical context. The study involved a cohort of 28 anonymized patient cases covering a range of neurological conditions and diagnostic complexities representative of daily clinical practice. The primary outcome was diagnostic accuracy of both neurologists and LLMs, defined as concordance with discharge diagnoses. Secondary outcomes included the appropriateness of recommended diagnostic tests, interrater agreement, and the extent of additional prompting required for accurate responses.Results: Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT (54%) and Gemini (46%). Both LLMs demonstrated limitations in nuanced clinical reasoning and overprescribed diagnostic tests in 17%-25% of cases. In addition, complex or ambiguous cases required further prompting to refine artificial intelligence-generated responses. Interrater reliability analysis using Fleiss Kappa showed a moderate-to-substantial level of agreement among raters (κ=0.47, SE 0.077; z=6.14, P<.001), indicating agreement between raters.Conclusions: While LLMs show potential as supportive tools in neurology, they currently lack the depth required for independent clinical decision-making when using freely available LLMs without previous specific training. The moderate agreement observed among human raters underscores the variability even in expert judgment and highlights the importance of rigorous validation when integrating artificial intelligence tools into clinical workflows. Future research should focus on refining LLM capabilities and developing evaluation methodologies that reflect the complexities of real-world neurological practice, ensuring effective, responsible, and safe use of such promising technologies.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e73212"},"PeriodicalIF":6.0000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453287/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73212","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) such as ChatGPT (OpenAI) and Gemini (Google) are increasingly explored for their potential in medical diagnostics, including neurology. Their real-world applicability remains inadequately assessed, particularly in clinical workflows where nuanced decision-making is required.

Objective: This study aims to evaluate the diagnostic accuracy and appropriateness of clinical recommendations provided by not-specifically-trained, freely available ChatGPT and Gemini, compared to neurologists, using real-world clinical cases.

Methods: This study consisted of an experimental evaluation of LLMs' diagnostic performance presenting real-world neurology cases to ChatGPT and Gemini, comparing their performance with that of clinical neurologists. The study was conducted simulating a first visit using information from anonymized patient records from the Neurology Department of the ASST Santi Paolo e Carlo Hospital, ensuring a real-world clinical context. The study involved a cohort of 28 anonymized patient cases covering a range of neurological conditions and diagnostic complexities representative of daily clinical practice. The primary outcome was diagnostic accuracy of both neurologists and LLMs, defined as concordance with discharge diagnoses. Secondary outcomes included the appropriateness of recommended diagnostic tests, interrater agreement, and the extent of additional prompting required for accurate responses.

Results: Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT (54%) and Gemini (46%). Both LLMs demonstrated limitations in nuanced clinical reasoning and overprescribed diagnostic tests in 17%-25% of cases. In addition, complex or ambiguous cases required further prompting to refine artificial intelligence-generated responses. Interrater reliability analysis using Fleiss Kappa showed a moderate-to-substantial level of agreement among raters (κ=0.47, SE 0.077; z=6.14, P<.001), indicating agreement between raters.

Conclusions: While LLMs show potential as supportive tools in neurology, they currently lack the depth required for independent clinical decision-making when using freely available LLMs without previous specific training. The moderate agreement observed among human raters underscores the variability even in expert judgment and highlights the importance of rigorous validation when integrating artificial intelligence tools into clinical workflows. Future research should focus on refining LLM capabilities and developing evaluation methodologies that reflect the complexities of real-world neurological practice, ensuring effective, responsible, and safe use of such promising technologies.

Abstract Image

查看原文本刊更多论文

神经学实践中的大型语言模型：真实世界的研究。

背景：ChatGPT （OpenAI）和Gemini（谷歌）等大型语言模型（llm）在包括神经病学在内的医学诊断领域的潜力正得到越来越多的探索。它们在现实世界的适用性仍然没有得到充分的评估，特别是在需要细致决策的临床工作流程中。目的：本研究旨在通过使用真实的临床病例，评估未经专门培训、可免费获得的ChatGPT和Gemini提供的临床建议与神经科医生相比的诊断准确性和适宜性。方法：本研究包括对LLMs向ChatGPT和Gemini展示真实神经病例的诊断能力进行实验评估，并将其与临床神经学家的表现进行比较。该研究使用来自圣保罗卡洛医院神经内科匿名患者记录的信息模拟首次就诊，以确保真实的临床环境。该研究涉及28例匿名患者病例，涵盖了一系列神经系统疾病和诊断复杂性，代表了日常临床实践。主要结果是神经科医生和法学硕士的诊断准确性，定义为与出院诊断的一致性。次要结局包括推荐的诊断测试的适当性、判读者之间的一致性以及准确反应所需的额外提示的程度。结果：神经科医生的诊断准确率达到75%，优于ChatGPT（54%）和Gemini（46%）。在17%-25%的病例中，两种llm都表现出细微临床推理的局限性和过度规定的诊断测试。此外，复杂或模糊的情况需要进一步的提示来完善人工智能生成的响应。使用Fleiss Kappa进行的评分者间信度分析显示，评分者之间存在中等至相当程度的一致性（κ=0.47, SE 0.077; z=6.14, p）。结论：虽然LLMs显示出作为神经病学支持工具的潜力，但在使用免费的LLMs而没有事先接受过专门培训时，它们目前缺乏独立临床决策所需的深度。在人类评分者中观察到的适度一致强调了即使在专家判断中也存在可变性，并强调了在将人工智能工具集成到临床工作流程中时严格验证的重要性。未来的研究应该集中在完善法学硕士的能力和开发反映现实世界神经实践复杂性的评估方法，确保有效、负责和安全地使用这些有前途的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.