Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study.

IF 2

JMIR AI Pub Date : 2024-04-29 DOI:10.2196/46875

Mohammad Hammoud, Shahd Douglas, Mohamad Darmach, Sara Alawneh, Swapnendu Sanyal, Youssef Kanbour

{"title":"Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study.","authors":"Mohammad Hammoud, Shahd Douglas, Mohamad Darmach, Sara Alawneh, Swapnendu Sanyal, Youssef Kanbour","doi":"10.2196/46875","DOIUrl":null,"url":null,"abstract":"Background: Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches.Objective: This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics.Methods: We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others.Results: The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively.Conclusions: The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e46875"},"PeriodicalIF":2.0000,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11091811/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/46875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches.

Objective: This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics.

Methods: We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F₁-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others.

Results: The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F₁-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F₁-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F₁-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively.

Conclusions: The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.

查看原文本刊更多论文

评估症状检查器的诊断性能：临床小故事研究

背景：医疗自我诊断工具（或症状检查器）正在成为数字健康和我们日常生活中不可或缺的一部分，患者越来越多地使用这些工具来确定其症状的根本原因。因此，有必要采用标准的临床和科学方法对症状检查器的诊断性能进行严格研究和全面报告：本研究旨在使用标准、透明的方法评估和报告一些已知和新型症状检查器的准确性，使科学界能够交叉验证和复制所报告的结果，这是健康信息学亟需的一步：方法：我们提出了一种分 4 个阶段的实验方法，利用标准的临床小故事方法对 6 种症状检查器进行评估。为此，我们开发了 400 个小故事并进行了同行评议，每个小故事都得到了 7 位独立的、经验丰富的初级保健医生中至少 5 位的认可。为了建立一个参考框架并据此解释症状检查器的结果，我们进一步将表现最好的症状检查器与 3 位平均经验为 16.6（标清 9.42）年的全科医生进行了比较。为了衡量准确性，我们使用了 7 个标准指标，包括衡量症状检查器或医生将小节的主要诊断结果排在鉴别列表首位的能力的指标 M1、权衡召回率和精确度的指标 F1-score，以及衡量鉴别列表排序质量的指标归一化折现累积收益（NDCG）等：结果：6 个受测症状检查器的诊断准确率差异很大。例如，表现最好和最差的症状检查器或范围之间的 M1、F1 分数和 NDCG 结果差异分别为 65.3%、39.2% 和 74.2%。在参与的人类医生中也观察到了同样的情况，M1、F1 分数和 NDCG 范围分别为 22.8%、15.3% 和 21.3%。在相互比较时，使用 F1 分数，医生的表现比表现最好的症状检查器平均高出 1.2%，而使用 M1 和 NDCG，表现最好的症状检查器的表现分别比医生平均高出 10.2% 和 25.1%：结论：症状检查器之间的性能差异很大，这表明不能将症状检查器视为单一实体。从另一个角度看，表现最好的症状检查器是基于人工智能（AI）的，这说明人工智能有望提高症状检查器的诊断能力，尤其是在人工智能不断飞速发展的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR AI

自引率

0.00%

发文量