Large language models for accurate disease detection in electronic health records

N. Bürgisser, MD E0enne Chalot, S. Mehouachi, MSc Clement P. Buclin, Kim Lauper, PhD Delphine S. Courvoisier, PhD Denis Mongin
{"title":"Large language models for accurate disease detection in electronic health records","authors":"N. Bürgisser, MD E0enne Chalot, S. Mehouachi, MSc Clement P. Buclin, Kim Lauper, PhD Delphine S. Courvoisier, PhD Denis Mongin","doi":"10.1101/2024.07.27.24311106","DOIUrl":null,"url":null,"abstract":"Importance: The use of large language models (LLMs) in medicine is increasing, with potential applications in electronic health records (EHR) to create patient cohorts or identify patients who meet clinical trial recruitment criteria. However, significant barriers remain, including the extensive computer resources required, lack of performance evaluation, and challenges in implementation. Objective: This study aims to propose and test a framework to detect disease diagnosis using a recent light LLM on French-language EHR documents. Specifically, it focuses on detecting gout (goutte in French), a ubiquitous French term that have multiple meanings beyond the disease. The study will compare the performance of the LLM-based framework with traditional natural language processing techniques and test its dependence on the parameter used. Design: The framework was developed using a training and testing set of 700 paragraphs assessing goutte, issued from a random selection of retrospective EHR documents. All paragraphs were manually reviewed and classified by two health-care professionals (HCP) into disease (true gout) and non-disease (gold standard). The LLM's accuracy was tested using few-shot and chain-of-thought prompting and compared to a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing Calcium Pyrophosphate Deposition Disease (CPPD). Setting: The documents were sampled from the electronic health-records of a tertiary university hospital in Geneva, Switzerland. Participants: Adults over 18 years of age. Exposure: Meta's Llama 3 8B LLM or traditional method, against a gold standard. Main Outcomes and Measures: Positive and negative predictive value, as well as accuracy of tested models. Results: The LLM-based algorithm outperformed the regex method, achieving a 92.7% [88.7-95.4%] positive predictive value, a 96.6% [94.6-97.8%] negative predictive value, and an accuracy of 95.4% [93.6-96.7%] for gout. In the validation set on CPPD, accuracy was 94.1% [90.2-97.6%]. The LLM framework performed well over a wide range of parameter values. Conclusions and Relevance: LLMs were able to accurately detect disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registries in any language, improving disease care assessment and patient recruitment for clinical trials.","PeriodicalId":506788,"journal":{"name":"medRxiv","volume":"22 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.27.24311106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Importance: The use of large language models (LLMs) in medicine is increasing, with potential applications in electronic health records (EHR) to create patient cohorts or identify patients who meet clinical trial recruitment criteria. However, significant barriers remain, including the extensive computer resources required, lack of performance evaluation, and challenges in implementation. Objective: This study aims to propose and test a framework to detect disease diagnosis using a recent light LLM on French-language EHR documents. Specifically, it focuses on detecting gout (goutte in French), a ubiquitous French term that have multiple meanings beyond the disease. The study will compare the performance of the LLM-based framework with traditional natural language processing techniques and test its dependence on the parameter used. Design: The framework was developed using a training and testing set of 700 paragraphs assessing goutte, issued from a random selection of retrospective EHR documents. All paragraphs were manually reviewed and classified by two health-care professionals (HCP) into disease (true gout) and non-disease (gold standard). The LLM's accuracy was tested using few-shot and chain-of-thought prompting and compared to a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing Calcium Pyrophosphate Deposition Disease (CPPD). Setting: The documents were sampled from the electronic health-records of a tertiary university hospital in Geneva, Switzerland. Participants: Adults over 18 years of age. Exposure: Meta's Llama 3 8B LLM or traditional method, against a gold standard. Main Outcomes and Measures: Positive and negative predictive value, as well as accuracy of tested models. Results: The LLM-based algorithm outperformed the regex method, achieving a 92.7% [88.7-95.4%] positive predictive value, a 96.6% [94.6-97.8%] negative predictive value, and an accuracy of 95.4% [93.6-96.7%] for gout. In the validation set on CPPD, accuracy was 94.1% [90.2-97.6%]. The LLM framework performed well over a wide range of parameter values. Conclusions and Relevance: LLMs were able to accurately detect disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registries in any language, improving disease care assessment and patient recruitment for clinical trials.
在电子健康记录中准确检测疾病的大型语言模型
重要性:大型语言模型(LLMs)在医学中的应用日益增多,并有可能应用于电子健康记录(EHR),以创建患者队列或识别符合临床试验招募标准的患者。然而,目前仍存在巨大的障碍,包括需要大量的计算机资源、缺乏性能评估以及实施方面的挑战。研究目的本研究旨在提出并测试一个框架,利用最新的轻型 LLM 对法语电子病历文档进行疾病诊断检测。具体来说,它侧重于检测痛风(法语中的 goutte),这是一个无处不在的法语术语,除疾病外还有多种含义。研究将比较基于 LLM 的框架与传统自然语言处理技术的性能,并测试其对所用参数的依赖性。设计:该框架的开发使用了从回顾性电子病历文档中随机抽取的 700 个评估痛风的段落作为训练和测试集。所有段落均由两名专业医护人员(HCP)进行人工审核和分类,分为疾病(真正的痛风)和非疾病(金标准)。LLM 的准确性通过少量提示和思维链提示进行了测试,并与基于正则表达式(regex)的方法进行了比较,重点关注模型参数和提示结构的影响。该框架在 600 个评估焦磷酸钙沉积症(CPPD)的段落中得到了进一步验证。环境:文件取自瑞士日内瓦一所三级大学医院的电子健康记录。参与者:18岁以上的成年人。暴露:采用 Meta's Llama 3 8B LLM 或传统方法,对照黄金标准。主要结果和测量:阳性和阴性预测值,以及测试模型的准确性。结果基于 LLM 的算法优于 regex 方法,对痛风的阳性预测值为 92.7% [88.7-95.4%],阴性预测值为 96.6% [94.6-97.8%],准确率为 95.4% [93.6-96.7%]。在 CPPD 验证集中,准确率为 94.1% [90.2-97.6%]。LLM 框架在广泛的参数值范围内均表现良好。结论和相关性:LLM 能够准确检测电子病历中的疾病诊断,即使是非英语语言。它们有助于创建任何语言的大型疾病登记,改善疾病护理评估和临床试验的患者招募。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信