Natale Vincenzo Maiorana, Sara Marceglia, Mauro Treddenti, Mattia Tosi, Matteo Guidetti, Maria Francesca Creta, Tommaso Bocci, Serena Oliveri, Filippo Martinelli Boneschi, Alberto Priori
{"title":"Large Language Models in Neurological Practice: Real-World Study.","authors":"Natale Vincenzo Maiorana, Sara Marceglia, Mauro Treddenti, Mattia Tosi, Matteo Guidetti, Maria Francesca Creta, Tommaso Bocci, Serena Oliveri, Filippo Martinelli Boneschi, Alberto Priori","doi":"10.2196/73212","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) such as ChatGPT (OpenAI) and Gemini (Google) are increasingly explored for their potential in medical diagnostics, including neurology. Their real-world applicability remains inadequately assessed, particularly in clinical workflows where nuanced decision-making is required.</p><p><strong>Objective: </strong>This study aims to evaluate the diagnostic accuracy and appropriateness of clinical recommendations provided by not-specifically-trained, freely available ChatGPT and Gemini, compared to neurologists, using real-world clinical cases.</p><p><strong>Methods: </strong>This study consisted of an experimental evaluation of LLMs' diagnostic performance presenting real-world neurology cases to ChatGPT and Gemini, comparing their performance with that of clinical neurologists. The study was conducted simulating a first visit using information from anonymized patient records from the Neurology Department of the ASST Santi Paolo e Carlo Hospital, ensuring a real-world clinical context. The study involved a cohort of 28 anonymized patient cases covering a range of neurological conditions and diagnostic complexities representative of daily clinical practice. The primary outcome was diagnostic accuracy of both neurologists and LLMs, defined as concordance with discharge diagnoses. Secondary outcomes included the appropriateness of recommended diagnostic tests, interrater agreement, and the extent of additional prompting required for accurate responses.</p><p><strong>Results: </strong>Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT (54%) and Gemini (46%). Both LLMs demonstrated limitations in nuanced clinical reasoning and overprescribed diagnostic tests in 17%-25% of cases. In addition, complex or ambiguous cases required further prompting to refine artificial intelligence-generated responses. Interrater reliability analysis using Fleiss Kappa showed a moderate-to-substantial level of agreement among raters (κ=0.47, SE 0.077; z=6.14, P<.001), indicating agreement between raters.</p><p><strong>Conclusions: </strong>While LLMs show potential as supportive tools in neurology, they currently lack the depth required for independent clinical decision-making when using freely available LLMs without previous specific training. The moderate agreement observed among human raters underscores the variability even in expert judgment and highlights the importance of rigorous validation when integrating artificial intelligence tools into clinical workflows. Future research should focus on refining LLM capabilities and developing evaluation methodologies that reflect the complexities of real-world neurological practice, ensuring effective, responsible, and safe use of such promising technologies.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e73212"},"PeriodicalIF":6.0000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453287/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73212","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) such as ChatGPT (OpenAI) and Gemini (Google) are increasingly explored for their potential in medical diagnostics, including neurology. Their real-world applicability remains inadequately assessed, particularly in clinical workflows where nuanced decision-making is required.
Objective: This study aims to evaluate the diagnostic accuracy and appropriateness of clinical recommendations provided by not-specifically-trained, freely available ChatGPT and Gemini, compared to neurologists, using real-world clinical cases.
Methods: This study consisted of an experimental evaluation of LLMs' diagnostic performance presenting real-world neurology cases to ChatGPT and Gemini, comparing their performance with that of clinical neurologists. The study was conducted simulating a first visit using information from anonymized patient records from the Neurology Department of the ASST Santi Paolo e Carlo Hospital, ensuring a real-world clinical context. The study involved a cohort of 28 anonymized patient cases covering a range of neurological conditions and diagnostic complexities representative of daily clinical practice. The primary outcome was diagnostic accuracy of both neurologists and LLMs, defined as concordance with discharge diagnoses. Secondary outcomes included the appropriateness of recommended diagnostic tests, interrater agreement, and the extent of additional prompting required for accurate responses.
Results: Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT (54%) and Gemini (46%). Both LLMs demonstrated limitations in nuanced clinical reasoning and overprescribed diagnostic tests in 17%-25% of cases. In addition, complex or ambiguous cases required further prompting to refine artificial intelligence-generated responses. Interrater reliability analysis using Fleiss Kappa showed a moderate-to-substantial level of agreement among raters (κ=0.47, SE 0.077; z=6.14, P<.001), indicating agreement between raters.
Conclusions: While LLMs show potential as supportive tools in neurology, they currently lack the depth required for independent clinical decision-making when using freely available LLMs without previous specific training. The moderate agreement observed among human raters underscores the variability even in expert judgment and highlights the importance of rigorous validation when integrating artificial intelligence tools into clinical workflows. Future research should focus on refining LLM capabilities and developing evaluation methodologies that reflect the complexities of real-world neurological practice, ensuring effective, responsible, and safe use of such promising technologies.
期刊介绍:
The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades.
As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor.
Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.