Accuracy of Large Language Models to Identify Stroke Subtypes Within Unstructured Electronic Health Record Data.

IF 8.9 1区医学 Q1 CLINICAL NEUROLOGY

Stroke Pub Date : 2025-10-01 Epub Date: 2025-07-25 DOI:10.1161/STROKEAHA.125.051993

Dylan Owens, Danh Q Nguyen, Michael Dohopolski, Justin F Rousseau, Eric D Peterson, Ann Marie Navar

{"title":"Accuracy of Large Language Models to Identify Stroke Subtypes Within Unstructured Electronic Health Record Data.","authors":"Dylan Owens, Danh Q Nguyen, Michael Dohopolski, Justin F Rousseau, Eric D Peterson, Ann Marie Navar","doi":"10.1161/STROKEAHA.125.051993","DOIUrl":null,"url":null,"abstract":"Background: While International Classification of Diseases, Tenth Revision codes suffice for identifying stroke events in surveillance, accurately classifying stroke types and subtypes using electronic health records remains challenging due to limitations in structured data. This often necessitates manual review of clinical documentation. This study evaluated whether a large language model, Generative Pre-Trained Transformer 4 Omni (GPT-4o), can accurately identify stroke types and subtypes from unstructured clinical notes.Methods: We implemented a retrieval-augmented generation framework with GPT-4o to classify stroke types (ischemic versus hemorrhagic) and ischemic stroke subtypes using electronic health records data. The American Heart Association Get With The Guidelines-Stroke registry served as the gold standard. Model development used a 20% subset of Get With The Guidelines-Stroke-linked data from UT Southwestern Medical Center (UTSW), with the remaining 80% reserved for testing. External validation used data from the Parkland Health and Hospital System (PHHS). A total of 4123 stroke hospitalizations from January 2019 to August 2023 were included (UTSW: n=2047; PHHS: n=2076). Three prompting strategies-zero-shot chain-of-thought, expert-guided, and instruction-based-were evaluated. Predictions of GPT-4os were compared with classifications made by trained abstractors contributing to the Get With The Guidelines-Stroke registry.Results: In the external validation set, 79.6% of patients had ischemic stroke and 20.4% hemorrhagic. GPT-4o achieved 98% accuracy (95% CI, 0.97-0.99) in classifying stroke type, where accuracy reflects the overall proportion of correctly classified patients. Sensitivity was 0.98 (95% CI, 0.97-0.99), and specificity was 0.97 (95% CI, 0.96-0.98). For ischemic stroke subtypes, sensitivity ranged from 0.40 (95% CI, 0.31-0.49) for cryptogenic to 0.95 (95% CI, 0.93-0.97) for small-vessel occlusion. Specificity ranged from 0.94 (95% CI, 0.92-0.96) for large-artery atherosclerosis to 0.98 (95% CI, 0.97-0.99) for cardioembolism. Zero-shot chain-of-thought prompting-requiring minimal human input-performed comparably to more labor-intensive strategies. Consistency analysis revealed >99% agreement across repeated queries.Conclusions: GPT-4o demonstrated strong accuracy in classifying stroke types but faced challenges with ischemic subtypes.","PeriodicalId":21989,"journal":{"name":"Stroke","volume":" ","pages":"2966-2975"},"PeriodicalIF":8.9000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313299/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stroke","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1161/STROKEAHA.125.051993","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: While International Classification of Diseases, Tenth Revision codes suffice for identifying stroke events in surveillance, accurately classifying stroke types and subtypes using electronic health records remains challenging due to limitations in structured data. This often necessitates manual review of clinical documentation. This study evaluated whether a large language model, Generative Pre-Trained Transformer 4 Omni (GPT-4o), can accurately identify stroke types and subtypes from unstructured clinical notes.

Methods: We implemented a retrieval-augmented generation framework with GPT-4o to classify stroke types (ischemic versus hemorrhagic) and ischemic stroke subtypes using electronic health records data. The American Heart Association Get With The Guidelines-Stroke registry served as the gold standard. Model development used a 20% subset of Get With The Guidelines-Stroke-linked data from UT Southwestern Medical Center (UTSW), with the remaining 80% reserved for testing. External validation used data from the Parkland Health and Hospital System (PHHS). A total of 4123 stroke hospitalizations from January 2019 to August 2023 were included (UTSW: n=2047; PHHS: n=2076). Three prompting strategies-zero-shot chain-of-thought, expert-guided, and instruction-based-were evaluated. Predictions of GPT-4os were compared with classifications made by trained abstractors contributing to the Get With The Guidelines-Stroke registry.

Results: In the external validation set, 79.6% of patients had ischemic stroke and 20.4% hemorrhagic. GPT-4o achieved 98% accuracy (95% CI, 0.97-0.99) in classifying stroke type, where accuracy reflects the overall proportion of correctly classified patients. Sensitivity was 0.98 (95% CI, 0.97-0.99), and specificity was 0.97 (95% CI, 0.96-0.98). For ischemic stroke subtypes, sensitivity ranged from 0.40 (95% CI, 0.31-0.49) for cryptogenic to 0.95 (95% CI, 0.93-0.97) for small-vessel occlusion. Specificity ranged from 0.94 (95% CI, 0.92-0.96) for large-artery atherosclerosis to 0.98 (95% CI, 0.97-0.99) for cardioembolism. Zero-shot chain-of-thought prompting-requiring minimal human input-performed comparably to more labor-intensive strategies. Consistency analysis revealed >99% agreement across repeated queries.

Conclusions: GPT-4o demonstrated strong accuracy in classifying stroke types but faced challenges with ischemic subtypes.

查看原文本刊更多论文

大型语言模型在非结构化电子健康记录数据中识别中风亚型的准确性。

背景：虽然国际疾病分类第十版代码足以在监测中识别卒中事件，但由于结构化数据的限制，使用电子健康记录准确分类卒中类型和亚型仍然具有挑战性。这通常需要手工审查临床文件。本研究评估了一个大型语言模型gpt - 40能否从非结构化的临床记录中准确识别中风类型和亚型。方法：我们使用gpt - 40实现了检索增强生成框架，使用电子健康记录数据对中风类型（缺血性与出血性）和缺血性中风亚型进行分类。美国心脏协会的中风登记指南是黄金标准。模型开发使用了来自UT西南医学中心的20%的卒中相关数据子集，其余80%用于测试。外部验证使用的数据来自帕克兰健康和医院系统。2019年1月至2023年8月共纳入4123例中风住院患者(UT西南医学中心：n=2047；帕克兰健康和医院系统：n=2076)。评估了三种提示策略——零射击思维链、专家指导和基于指导。将gpt - 40的预测结果与训练有素的抽象人员为“遵循指南-卒中”登记所做的分类进行比较。结果：在外部验证集中，79.6%的患者发生缺血性脑卒中，20.4%的患者发生出血性脑卒中。gpt - 40在脑卒中类型分类方面达到98%的准确率（95% CI, 0.97-0.99），其中准确率反映了正确分类患者的总体比例。敏感性为0.98 (95% CI, 0.97-0.99)，特异性为0.97 （95% CI, 0.96-0.98）。对于缺血性卒中亚型，隐源性的敏感性为0.40 (95% CI, 0.31-0.49)，小血管闭塞的敏感性为0.95 （95% CI, 0.93-0.97）。特异性范围从大动脉粥样硬化的0.94 （95% CI, 0.92-0.96）到心脏栓塞的0.98 （95% CI, 0.97-0.99）。零射击思维链提示——需要最少的人力投入——与更劳动密集型的策略相比表现得更好。一致性分析显示，重复查询之间的一致性为bbbb99 %。结论：gpt - 40在脑卒中类型分类中表现出较强的准确性，但在缺血性亚型分类中面临挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Stroke 医学-临床神经学

CiteScore

13.40

自引率

6.00%

发文量

2021

审稿时长

3 months

期刊介绍： Stroke is a monthly publication that collates reports of clinical and basic investigation of any aspect of the cerebral circulation and its diseases. The publication covers a wide range of disciplines including anesthesiology, critical care medicine, epidemiology, internal medicine, neurology, neuro-ophthalmology, neuropathology, neuropsychology, neurosurgery, nuclear medicine, nursing, radiology, rehabilitation, speech pathology, vascular physiology, and vascular surgery. The audience of Stroke includes neurologists, basic scientists, cardiologists, vascular surgeons, internists, interventionalists, neurosurgeons, nurses, and physiatrists. Stroke is indexed in Biological Abstracts, BIOSIS, CAB Abstracts, Chemical Abstracts, CINAHL, Current Contents, Embase, MEDLINE, and Science Citation Index Expanded.