MedPromptEval：临床问答系统评估的综合框架。

Studies in health technology and informatics Pub Date : 2025-10-02 DOI:10.3233/SHTI251540

Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui

{"title":"MedPromptEval：临床问答系统评估的综合框架。","authors":"Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui","doi":"10.3233/SHTI251540","DOIUrl":null,"url":null,"abstract":"Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"262-266"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MedPromptEval: A Comprehensive Framework for Systematic Evaluation of Clinical Question Answering Systems.\",\"authors\":\"Al Rahrooh, Anders O Garlid, Panayiotis Petousis, Arthur Fumnell, Alex A T Bui\",\"doi\":\"10.3233/SHTI251540\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"332 \",\"pages\":\"262-266\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251540\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）的临床部署面临着严峻的挑战，包括不一致的提示性能，可变的模型行为，以及缺乏标准化的评估方法。我们提出MedPromptEval，这是一个从临床相关维度系统评估llm提示组合的框架。该框架自动生成不同的提示类型，跨多个llm编排响应生成，并通过测量事实准确性、语义相关性、蕴意一致性和语言适当性的多个度量来量化性能。我们以不同的评估模式展示了MedPromptEval在公开可用的临床问答（QA）数据集（MedQuAD、PubMedQA和HealthCareMagic）中的效用：1)使用标准化提示进行模型比较；2)基于受控模型的提示策略优化；3)对提示模型配置的广泛评估。通过对临床LLM和QA应用程序进行可重复的基准测试，MedPromptEval为优化提示工程和模型选择提供了见解，促进了卫生保健环境中语言模型的可靠和有效集成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MedPromptEval: A Comprehensive Framework for Systematic Evaluation of Clinical Question Answering Systems.

Clinical deployment of large language models (LLMs) faces critical challenges, including inconsistent prompt performance, variable model behavior, and a lack of standardized evaluation methodologies. We present MedPromptEval, a framework that systematically evaluates LLM-prompt combinations across clinically relevant dimensions. This framework automatically generates diverse prompt types, orchestrates response generation across multiple LLMs, and quantifies performance through multiple metrics measuring factual accuracy, semantic relevance, entailment consistency, and linguistic appropriateness. We demonstrate MedPromptEval's utility across publicly available clinical question answering (QA) datasets - MedQuAD, PubMedQA, and HealthCareMagic - in distinct evaluation modes: 1) model comparison using standardized prompts; 2) prompt strategy optimization using a controlled model; and 3) extensive assessment of prompt-model configurations. By enabling reproducible benchmarking of clinical LLM and QA applications, MedPromptEval provides insights for optimizing prompt engineering and model selection, advancing the reliable and effective integration of language models in health care settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Studies in health technology and informatics

自引率

0.00%

发文量