Evaluating Large Language Models in extracting cognitive exam dates and scores.

PLOS digital health Pub Date : 2024-12-11 eCollection Date: 2024-12-01 DOI:10.1371/journal.pdig.0000685
Hao Zhang, Neil Jethani, Simon Jones, Nicholas Genes, Vincent J Major, Ian S Jaffe, Anthony B Cardillo, Noah Heilenbach, Nadia Fazal Ali, Luke J Bonanni, Andrew J Clayburn, Zain Khera, Erica C Sadler, Jaideep Prasad, Jamie Schlacter, Kevin Liu, Benjamin Silva, Sophie Montgomery, Eric J Kim, Jacob Lester, Theodore M Hill, Alba Avoricani, Ethan Chervonski, James Davydov, William Small, Eesha Chakravartty, Himanshu Grover, John A Dodson, Abraham A Brody, Yindalon Aphinyanaphongs, Arjun Masurkar, Narges Razavian
{"title":"Evaluating Large Language Models in extracting cognitive exam dates and scores.","authors":"Hao Zhang, Neil Jethani, Simon Jones, Nicholas Genes, Vincent J Major, Ian S Jaffe, Anthony B Cardillo, Noah Heilenbach, Nadia Fazal Ali, Luke J Bonanni, Andrew J Clayburn, Zain Khera, Erica C Sadler, Jaideep Prasad, Jamie Schlacter, Kevin Liu, Benjamin Silva, Sophie Montgomery, Eric J Kim, Jacob Lester, Theodore M Hill, Alba Avoricani, Ethan Chervonski, James Davydov, William Small, Eesha Chakravartty, Himanshu Grover, John A Dodson, Abraham A Brody, Yindalon Aphinyanaphongs, Arjun Masurkar, Narges Razavian","doi":"10.1371/journal.pdig.0000685","DOIUrl":null,"url":null,"abstract":"<p><p>Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 12","pages":"e0000685"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11634005/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000685","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

评估大型语言模型在提取认知考试日期和分数中的应用。
确保大型语言模型(llm)在临床任务中的可靠性至关重要。我们的研究评估了两个最先进的llm (ChatGPT和LlaMA-2)用于提取临床信息,重点是认知测试,如MMSE和CDR。我们的数据包括135,307份涉及MMSE、CDR或MoCA的临床记录(2010年1月12日至2023年5月24日)。在应用纳入标准后,剩下34,465条注释,其中765条进行了ChatGPT (GPT-4)和LlaMA-2, 22位专家对回复进行了审查。ChatGPT成功地从742条注释中提取了带有日期的MMSE和CDR实例。我们使用了20个音符进行微调和培训评审人员。其余722份被分配给审稿人,其中309份同时分配给两名审稿人。评估间一致性(Fleiss’Kappa)、准确率、召回率、真/假阴性率和准确率进行计算。我们的研究遵循TRIPOD报告准则进行模型验证。对于MMSE信息提取,ChatGPT (vs. LlaMA-2)的准确率为83% (vs. 66.4%),灵敏度为89.7% (vs. 69.9%),真阴性率为96% (vs. 60.0%),精密度为82.7% (vs. 62.2%)。CDR结果总体上较低,准确率为87.1% (vs. 74.5%),灵敏度为84.3% (vs. 39.7%),真阴性率为99.8% (vs. 98.4%),精密度为48.3% (vs. 16.1%)。我们对ChatGPT和LlaMA-2在复核笔记上的MMSE误差进行了定性评价。LlaMA-2错误包括27例总幻觉,19例报告其他分数而不是MMSE, 25例漏报分数,23例仅报告错误日期。相比之下,ChatGPT的错误仅包括3例总幻觉,17例报告错误的测试而不是MMSE, 19例报告错误的日期。在这项将ChatGPT和LlaMA-2用于从临床记录中提取认知考试日期和分数的诊断/预后研究中,ChatGPT表现出较高的准确性,与LlaMA-2相比表现更好。llm的使用可以通过确定治疗初始化或临床试验登记的合格患者,使痴呆症研究和临床护理受益。对法学硕士进行严格的评估对于了解其能力和局限性至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信