Kerstin Denecke, Alaa Abd-Alrazaq, Mowafa Househ, Jim Warren
{"title":"Evaluation Metrics for Health Chatbots: A Delphi Study.","authors":"Kerstin Denecke, Alaa Abd-Alrazaq, Mowafa Househ, Jim Warren","doi":"10.1055/s-0041-1736664","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.</p><p><strong>Objectives: </strong>The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.</p><p><strong>Methods: </strong>We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).</p><p><strong>Results: </strong>Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon-only moderate or low consensus was achieved for those metrics.</p><p><strong>Conclusion: </strong>The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"60 5-06","pages":"171-179"},"PeriodicalIF":1.3000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0041-1736664","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/10/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 9
Abstract
Background: In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.
Objectives: The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.
Methods: We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).
Results: Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon-only moderate or low consensus was achieved for those metrics.
Conclusion: The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.
期刊介绍:
Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.