MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan
{"title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications","authors":"Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan","doi":"arxiv-2409.07314","DOIUrl":null,"url":null,"abstract":"The rapid development of Large Language Models (LLMs) for healthcare\napplications has spurred calls for holistic evaluation beyond frequently-cited\nbenchmarks like USMLE, to better reflect real-world performance. While\nreal-world assessments are valuable indicators of utility, they often lag\nbehind the pace of LLM evolution, likely rendering findings obsolete upon\ndeployment. This temporal disconnect necessitates a comprehensive upfront\nevaluation that can guide model selection for specific clinical applications.\nWe introduce MEDIC, a framework assessing LLMs across five critical dimensions\nof clinical competence: medical reasoning, ethics and bias, data and language\nunderstanding, in-context learning, and clinical safety. MEDIC features a novel\ncross-examination framework quantifying LLM performance across areas like\ncoverage and hallucination detection, without requiring reference outputs. We\napply MEDIC to evaluate LLMs on medical question-answering, safety,\nsummarization, note generation, and other tasks. Our results show performance\ndisparities across model sizes, baseline vs medically finetuned models, and\nhave implications on model selection for applications requiring specific model\nstrengths, such as low hallucination or lower cost of inference. MEDIC's\nmultifaceted evaluation reveals these performance trade-offs, bridging the gap\nbetween theoretical capabilities and practical implementation in healthcare\nsettings, ensuring that the most promising models are identified and adapted\nfor diverse healthcare applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.
MEDIC:建立评估临床应用中的法律硕士的综合框架
用于医疗保健应用的大型语言模型(LLMs)的快速发展促使人们呼吁在 USMLE 等经常被引用的基准之外进行整体评估,以更好地反映真实世界的性能。虽然真实世界的评估是衡量实用性的重要指标,但它们往往落后于 LLM 的发展速度,很可能导致评估结果在部署前就已经过时。我们介绍了 MEDIC,这是一个评估 LLM 的框架,涉及临床能力的五个关键维度:医学推理、伦理与偏见、数据与语言理解、情境学习和临床安全。MEDIC 具有一个新颖的交叉检查框架,可量化 LLM 在覆盖率和幻觉检测等方面的表现,而无需参考输出。我们将 MEDIC 用于评估 LLM 在医学问题解答、安全性、总结、笔记生成和其他任务上的表现。我们的结果表明,不同大小的模型、基线模型与经过医学微调的模型在性能上存在差异,并对需要特定模型优势(如低幻觉或较低推理成本)的应用中的模型选择产生了影响。MEDIC 的多方面评估揭示了这些性能权衡,缩小了理论能力与医疗环境中实际应用之间的差距,确保识别出最有前途的模型,并将其应用于各种医疗应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信