医疗保健的可再生生成人工智能评估：临床医生在循环中的方法。

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-06-16 eCollection Date: 2025-06-01 DOI:10.1093/jamiaopen/ooaf054

Leah Livingston, Amber Featherstone-Uwague, Amanda Barry, Kenneth Barretto, Tara Morey, Drahomira Herrmannova, Venkatesh Avula

{"title":"医疗保健的可再生生成人工智能评估：临床医生在循环中的方法。","authors":"Leah Livingston, Amber Featherstone-Uwague, Amanda Barry, Kenneth Barretto, Tara Morey, Drahomira Herrmannova, Venkatesh Avula","doi":"10.1093/jamiaopen/ooaf054","DOIUrl":null,"url":null,"abstract":"Objectives: To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.Materials and methods: A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.Results: Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.Discussion: The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.Conclusion: This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf054"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12169418/pdf/","citationCount":"0","resultStr":"{\"title\":\"Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.\",\"authors\":\"Leah Livingston, Amber Featherstone-Uwague, Amanda Barry, Kenneth Barretto, Tara Morey, Drahomira Herrmannova, Venkatesh Avula\",\"doi\":\"10.1093/jamiaopen/ooaf054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objectives: To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.Materials and methods: A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.Results: Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.Discussion: The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.Conclusion: This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.\",\"PeriodicalId\":36278,\"journal\":{\"name\":\"JAMIA Open\",\"volume\":\"8 3\",\"pages\":\"ooaf054\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12169418/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JAMIA Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamiaopen/ooaf054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的：开发和应用一种可重复的方法来评估卫生保健中的生成式人工智能（AI）驱动系统，解决理论评估框架和实际实施指导之间的差距。材料和方法：开发了一个5维评估框架来评估查询理解和响应的帮助性、正确性、完整性和潜在的临床危害。使用从用户日志、基准数据集和主题专家策划的查询中提取的查询，应用该框架来评估ClinicalKey AI。41名委员会认证的医生和药剂师被招募来独立评估询问-回应对。采用模型和改进的德尔菲法的协议协议解决了评价分数的分歧。结果：在633个查询中，614个（96.99%）产生了可评估的回复，主题专家完成了426个查询-回复对的评估。结果显示了较高的回答正确性（95.5%）和查询理解率（98.6%），其中94.4%的回答被评为有帮助。2个应答（0.47%）获得潜在临床危害评分。60.6%的评估出现两两共识，其余病例需要第三次决胜审查。讨论：该框架通过综合评价维度和结构化评分解决方法证明了量化绩效的有效性。主要优势包括代表性查询抽样、标准化评级尺度和健壮的主题专家协议协议。在管理开放式答复的主观评估和就潜在危害分类达成共识方面出现了挑战。结论：该框架为评估卫生保健生成式人工智能系统提供了可重复的方法，建立了基础流程，可以为未来的工作提供信息，同时支持在临床环境中实施生成式人工智能应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.

查看原文本刊更多论文

Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach.

Objectives: To develop and apply a reproducible methodology for evaluating generative artificial intelligence (AI) powered systems in health care, addressing the gap between theoretical evaluation frameworks and practical implementation guidance.

Materials and methods: A 5-dimension evaluation framework was developed to assess query comprehension and response helpfulness, correctness, completeness, and potential clinical harm. The framework was applied to evaluate ClinicalKey AI using queries drawn from user logs, a benchmark dataset, and subject matter expert curated queries. Forty-one board-certified physicians and pharmacists were recruited to independently evaluate query-response pairs. An agreement protocol using the mode and modified Delphi method resolved disagreements in evaluation scores.

Results: Of 633 queries, 614 (96.99%) produced evaluable responses, with subject matter experts completing evaluations of 426 query-response pairs. Results demonstrated high rates of response correctness (95.5%) and query comprehension (98.6%), with 94.4% of responses rated as helpful. Two responses (0.47%) received scores indicating potential clinical harm. Pairwise consensus occurred in 60.6% of evaluations, with remaining cases requiring third tie-breaker review.

Discussion: The framework demonstrated effectiveness in quantifying performance through comprehensive evaluation dimensions and structured scoring resolution methods. Key strengths included representative query sampling, standardized rating scales, and robust subject matter expert agreement protocols. Challenges emerged in managing subjective assessments of open-ended responses and achieving consensus on potential harm classification.

Conclusion: This framework offers a reproducible methodology for evaluating health-care generative AI systems, establishing foundational processes that can inform future efforts while supporting the implementation of generative AI applications in clinical settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JAMIA Open Medicine-Health Informatics

CiteScore

4.10

自引率

4.80%

发文量

102

审稿时长

16 weeks