Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study.

JMIR nursing Pub Date : 2025-02-27 DOI:10.2196/63058
Seungheon Choo, Suyoung Yoo, Kumiko Endo, Bao Truong, Meong Hi Son
{"title":"Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study.","authors":"Seungheon Choo, Suyoung Yoo, Kumiko Endo, Bao Truong, Meong Hi Son","doi":"10.2196/63058","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The health care sector faces a projected shortfall of 10 million workers by 2030. Artificial intelligence (AI) automation in areas such as patient education and initial therapy screening presents a strategic response to mitigate this shortage and reallocate medical staff to higher-priority tasks. However, current methods of evaluating early-stage health care AI chatbots are highly limited due to safety concerns and the amount of time and effort that goes into evaluating them.</p><p><strong>Objective: </strong>This study introduces a novel 3-bot method for efficiently testing and validating early-stage AI health care provider chatbots. To extensively test AI provider chatbots without involving real patients or researchers, various AI patient bots and an evaluator bot were developed.</p><p><strong>Methods: </strong>Provider bots interacted with AI patient bots embodying frustrated, anxious, or depressed personas. An evaluator bot reviewed interaction transcripts based on specific criteria. Human experts then reviewed each interaction transcript, and the evaluator bot's results were compared to human evaluation results to ensure accuracy.</p><p><strong>Results: </strong>The patient-education bot's evaluations by the AI evaluator and the human evaluator were nearly identical, with minimal variance, limiting the opportunity for further analysis. The screening bot's evaluations also yielded similar results between the AI evaluator and human evaluator. Statistical analysis confirmed the reliability and accuracy of the AI evaluations.</p><p><strong>Conclusions: </strong>The innovative evaluation method ensures a safe, adaptable, and effective means to test and refine early versions of health care provider chatbots without risking patient safety or investing excessive researcher time and effort. Our patient-education evaluator bots could have benefitted from larger evaluation criteria, as we had extremely similar results from the AI and human evaluators, which could have arisen because of the small number of evaluation criteria. We were limited in the amount of prompting we could input into each bot due to the practical consideration that response time increases with larger and larger prompts. In the future, using techniques such as retrieval augmented generation will allow the system to receive more information and become more specific and accurate in evaluating the chatbots. This evaluation method will allow for rapid testing and validation of health care chatbots to automate basic medical tasks, freeing providers to address more complex tasks.</p>","PeriodicalId":73556,"journal":{"name":"JMIR nursing","volume":"8 ","pages":"e63058"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884306/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR nursing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/63058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The health care sector faces a projected shortfall of 10 million workers by 2030. Artificial intelligence (AI) automation in areas such as patient education and initial therapy screening presents a strategic response to mitigate this shortage and reallocate medical staff to higher-priority tasks. However, current methods of evaluating early-stage health care AI chatbots are highly limited due to safety concerns and the amount of time and effort that goes into evaluating them.

Objective: This study introduces a novel 3-bot method for efficiently testing and validating early-stage AI health care provider chatbots. To extensively test AI provider chatbots without involving real patients or researchers, various AI patient bots and an evaluator bot were developed.

Methods: Provider bots interacted with AI patient bots embodying frustrated, anxious, or depressed personas. An evaluator bot reviewed interaction transcripts based on specific criteria. Human experts then reviewed each interaction transcript, and the evaluator bot's results were compared to human evaluation results to ensure accuracy.

Results: The patient-education bot's evaluations by the AI evaluator and the human evaluator were nearly identical, with minimal variance, limiting the opportunity for further analysis. The screening bot's evaluations also yielded similar results between the AI evaluator and human evaluator. Statistical analysis confirmed the reliability and accuracy of the AI evaluations.

Conclusions: The innovative evaluation method ensures a safe, adaptable, and effective means to test and refine early versions of health care provider chatbots without risking patient safety or investing excessive researcher time and effort. Our patient-education evaluator bots could have benefitted from larger evaluation criteria, as we had extremely similar results from the AI and human evaluators, which could have arisen because of the small number of evaluation criteria. We were limited in the amount of prompting we could input into each bot due to the practical consideration that response time increases with larger and larger prompts. In the future, using techniques such as retrieval augmented generation will allow the system to receive more information and become more specific and accurate in evaluating the chatbots. This evaluation method will allow for rapid testing and validation of health care chatbots to automate basic medical tasks, freeing providers to address more complex tasks.

使用人工智能评估推进临床聊天机器人验证与新的3-Bot评估系统:仪器验证研究。
背景:预计到2030年,卫生保健部门将面临1000万名工作人员的缺口。人工智能(AI)自动化在患者教育和初始治疗筛选等领域提出了缓解这一短缺的战略对策,并将医务人员重新分配到更优先的任务上。然而,目前评估早期医疗保健人工智能聊天机器人的方法由于安全问题以及评估它们所需的时间和精力而受到高度限制。目的:本研究介绍了一种新的3-bot方法,用于有效测试和验证早期AI医疗保健提供者聊天机器人。为了在不涉及真实患者或研究人员的情况下广泛测试人工智能提供者聊天机器人,开发了各种人工智能患者机器人和评估机器人。方法:提供者机器人与人工智能患者机器人互动,体现出沮丧、焦虑或抑郁的角色。评估机器人根据特定的标准审查交互记录。然后,人类专家审查每个交互记录,并将评估机器人的结果与人类评估结果进行比较,以确保准确性。结果:人工智能评估者和人类评估者对患者教育机器人的评估几乎相同,差异很小,限制了进一步分析的机会。筛选机器人的评估在人工智能评估者和人类评估者之间也产生了类似的结果。统计分析证实了人工智能评估的可靠性和准确性。结论:创新的评估方法确保了一种安全、适应性强、有效的方法来测试和改进早期版本的医疗保健提供者聊天机器人,而不会危及患者的安全,也不会投入过多的研究人员的时间和精力。我们的病人教育评估机器人可以从更大的评估标准中受益,因为我们从人工智能和人类评估者那里得到了非常相似的结果,这可能是因为评估标准的数量很少。由于实际考虑到提示越来越大,响应时间也会增加,因此我们限制了每个bot的提示数量。在未来,使用诸如检索增强生成之类的技术将使系统能够接收更多信息,并在评估聊天机器人时变得更加具体和准确。这种评估方法将允许对医疗聊天机器人进行快速测试和验证,以实现基本医疗任务的自动化,从而使提供者能够解决更复杂的任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.20
自引率
0.00%
发文量
0
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信