Evaluation Metrics for Health Chatbots: A Delphi Study.

IF 1.3 4区 医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Methods of Information in Medicine Pub Date : 2021-12-01 Epub Date: 2021-10-31 DOI:10.1055/s-0041-1736664
Kerstin Denecke, Alaa Abd-Alrazaq, Mowafa Househ, Jim Warren
{"title":"Evaluation Metrics for Health Chatbots: A Delphi Study.","authors":"Kerstin Denecke,&nbsp;Alaa Abd-Alrazaq,&nbsp;Mowafa Househ,&nbsp;Jim Warren","doi":"10.1055/s-0041-1736664","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.</p><p><strong>Objectives: </strong>The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.</p><p><strong>Methods: </strong>We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).</p><p><strong>Results: </strong>Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon-only moderate or low consensus was achieved for those metrics.</p><p><strong>Conclusion: </strong>The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"60 5-06","pages":"171-179"},"PeriodicalIF":1.3000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0041-1736664","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/10/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 9

Abstract

Background: In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.

Objectives: The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.

Methods: We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).

Results: Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon-only moderate or low consensus was achieved for those metrics.

Conclusion: The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.

健康聊天机器人的评估指标:德尔菲研究。
背景:近年来,越来越多的健康聊天机器人在应用商店中发布,并在研究文献中进行描述。鉴于它们正在处理的敏感数据和开发它们的护理环境,评估对于避免对用户造成伤害至关重要。然而,这些系统的评估报告不一致,而且没有使用一套标准化的评估量度。缺乏健康聊天机器人评估的标准阻碍了系统的比较,这可能会妨碍可接受性,因为它们的可靠性尚不清楚。目的:本文的目的是通过在相关指标上达成共识,朝着开发特定于健康的聊天机器人评估框架迈出重要的一步。方法:我们采用适应性德尔菲研究设计来验证和选择我们最初从范围审查中检索到的潜在指标。我们邀请了研究人员、卫生专业人员和卫生信息学家对每个指标进行评分,以便在三轮调查中纳入最终评估框架。我们区分了与高、中、低共识相关的指标得分。最初的度量标准包括26个度量标准(分类为全局度量标准、与响应生成、响应理解和美学相关的度量标准)。结果:28名专家参加第一轮,22名(75%)专家坚持到第三轮。24个指标达到高度一致,3个指标达到中等一致。我们框架的核心集主要包括全局度量(例如,易用性、安全性内容准确性)、与响应生成相关的度量(例如,响应的适当性)以及与响应理解相关的度量。关于美学的指标(字体类型和大小,颜色)还没有达成一致,只有中等或较低的共识。结论:结果表明,专家在很大程度上同意的指标,共识集是广泛的。这意味着健康聊天机器人的评估必须是多方面的,以确保可接受性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Methods of Information in Medicine
Methods of Information in Medicine 医学-计算机:信息系统
CiteScore
3.70
自引率
11.80%
发文量
33
审稿时长
6-12 weeks
期刊介绍: Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信