MedPromptEval:临床问答系统评估的综合框架。

Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui
{"title":"MedPromptEval:临床问答系统评估的综合框架。","authors":"Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui","doi":"10.3233/SHTI251540","DOIUrl":null,"url":null,"abstract":"<p><p>Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"262-266"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MedPromptEval: A Comprehensive Framework for Systematic Evaluation of Clinical Question Answering Systems.\",\"authors\":\"Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui\",\"doi\":\"10.3233/SHTI251540\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"332 \",\"pages\":\"262-266\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251540\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(llm)的临床部署面临着严峻的挑战,包括不一致的提示性能,可变的模型行为,以及缺乏标准化的评估方法。我们提出MedPromptEval,这是一个从临床相关维度系统评估llm提示组合的框架。该框架自动生成不同的提示类型,跨多个llm编排响应生成,并通过测量事实准确性、语义相关性、蕴意一致性和语言适当性的多个度量来量化性能。我们以不同的评估模式展示了MedPromptEval在公开可用的临床问答(QA)数据集(MedQuAD、PubMedQA和HealthCareMagic)中的效用:1)使用标准化提示进行模型比较;2)基于受控模型的提示策略优化;3)对提示模型配置的广泛评估。通过对临床LLM和QA应用程序进行可重复的基准测试,MedPromptEval为优化提示工程和模型选择提供了见解,促进了卫生保健环境中语言模型的可靠和有效集成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MedPromptEval: A Comprehensive Framework for Systematic Evaluation of Clinical Question Answering Systems.

Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信