Facilitating Trust Calibration in Artificial Intelligence-Driven Diagnostic Decision Support Systems for Determining Physicians' Diagnostic Accuracy: Quasi-Experimental Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES
Tetsu Sakamoto, Yukinori Harada, Taro Shimizu
{"title":"Facilitating Trust Calibration in Artificial Intelligence-Driven Diagnostic Decision Support Systems for Determining Physicians' Diagnostic Accuracy: Quasi-Experimental Study.","authors":"Tetsu Sakamoto, Yukinori Harada, Taro Shimizu","doi":"10.2196/58666","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Diagnostic errors are significant problems in medical care. Despite the usefulness of artificial intelligence (AI)-based diagnostic decision support systems, the overreliance of physicians on AI-generated diagnoses may lead to diagnostic errors.</p><p><strong>Objective: </strong>We investigated the safe use of AI-based diagnostic decision support systems with trust calibration by adjusting trust levels to match the actual reliability of AI.</p><p><strong>Methods: </strong>A quasi-experimental study was conducted at Dokkyo Medical University, Japan, with physicians allocated (1:1) to the intervention and control groups. A total of 20 clinical cases were created based on the medical histories recorded by an AI-driven automated medical history-taking system from actual patients who visited a community-based hospital in Japan. The participants reviewed the medical histories of 20 clinical cases generated by an AI-driven automated medical history-taking system with an AI-generated list of 10 differential diagnoses and provided 1 to 3 possible diagnoses. Physicians were asked whether the final diagnosis was in the AI-generated list of 10 differential diagnoses in the intervention group, which served as the trust calibration. We analyzed the diagnostic accuracy of physicians and the correctness of the trust calibration in the intervention group. We also investigated the relationship between the accuracy of the trust calibration and the diagnostic accuracy of physicians, and the physicians' confidence level regarding the use of AI.</p><p><strong>Results: </strong>Among the 20 physicians assigned to the intervention (n=10) and control (n=10) groups, the mean age was 30.9 (SD 3.9) years and 31.7 (SD 4.2) years, the proportion of men was 80% and 60%, and the mean postgraduate year was 5.8 (SD 2.9) and 7.2 (SD 4.6), respectively, with no significant differences. The physicians' diagnostic accuracy was 41.5% in the intervention group and 46% in the control group, with no significant difference (95% CI -0.75 to 2.55; P=.27). The overall accuracy of the trust calibration was only 61.5%, and despite correct calibration, the diagnostic accuracy was 54.5%. In the multivariate logistic regression model, the accuracy of the trust calibration was a significant contributor to the diagnostic accuracy of physicians (adjusted odds ratio 5.90, 95% CI 2.93-12.46; P<.001). The mean confidence level for AI was 72.5% in the intervention group and 45% in the control group, with no significant difference.</p><p><strong>Conclusions: </strong>Trust calibration did not significantly improve physicians' diagnostic accuracy when considering the differential diagnoses generated by reading medical histories and the possible differential diagnosis lists of an AI-driven automated medical history-taking system. As this was a formative study, the small sample size and suboptimal trust calibration methods may have contributed to the lack of significant differences. This study highlights the need for a larger sample size and the implementation of supportive measures of trust calibration.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"8 ","pages":"e58666"},"PeriodicalIF":2.0000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Diagnostic errors are significant problems in medical care. Despite the usefulness of artificial intelligence (AI)-based diagnostic decision support systems, the overreliance of physicians on AI-generated diagnoses may lead to diagnostic errors.

Objective: We investigated the safe use of AI-based diagnostic decision support systems with trust calibration by adjusting trust levels to match the actual reliability of AI.

Methods: A quasi-experimental study was conducted at Dokkyo Medical University, Japan, with physicians allocated (1:1) to the intervention and control groups. A total of 20 clinical cases were created based on the medical histories recorded by an AI-driven automated medical history-taking system from actual patients who visited a community-based hospital in Japan. The participants reviewed the medical histories of 20 clinical cases generated by an AI-driven automated medical history-taking system with an AI-generated list of 10 differential diagnoses and provided 1 to 3 possible diagnoses. Physicians were asked whether the final diagnosis was in the AI-generated list of 10 differential diagnoses in the intervention group, which served as the trust calibration. We analyzed the diagnostic accuracy of physicians and the correctness of the trust calibration in the intervention group. We also investigated the relationship between the accuracy of the trust calibration and the diagnostic accuracy of physicians, and the physicians' confidence level regarding the use of AI.

Results: Among the 20 physicians assigned to the intervention (n=10) and control (n=10) groups, the mean age was 30.9 (SD 3.9) years and 31.7 (SD 4.2) years, the proportion of men was 80% and 60%, and the mean postgraduate year was 5.8 (SD 2.9) and 7.2 (SD 4.6), respectively, with no significant differences. The physicians' diagnostic accuracy was 41.5% in the intervention group and 46% in the control group, with no significant difference (95% CI -0.75 to 2.55; P=.27). The overall accuracy of the trust calibration was only 61.5%, and despite correct calibration, the diagnostic accuracy was 54.5%. In the multivariate logistic regression model, the accuracy of the trust calibration was a significant contributor to the diagnostic accuracy of physicians (adjusted odds ratio 5.90, 95% CI 2.93-12.46; P<.001). The mean confidence level for AI was 72.5% in the intervention group and 45% in the control group, with no significant difference.

Conclusions: Trust calibration did not significantly improve physicians' diagnostic accuracy when considering the differential diagnoses generated by reading medical histories and the possible differential diagnosis lists of an AI-driven automated medical history-taking system. As this was a formative study, the small sample size and suboptimal trust calibration methods may have contributed to the lack of significant differences. This study highlights the need for a larger sample size and the implementation of supportive measures of trust calibration.

促进人工智能诊断决策支持系统的信任校准,以确定医生的诊断准确性:准实验研究。
背景:诊断错误是医疗护理中的重大问题。尽管基于人工智能(AI)的诊断决策支持系统非常有用,但医生过度依赖人工智能生成的诊断结果可能会导致诊断错误:通过调整信任度,使其与人工智能的实际可靠性相匹配,我们对基于人工智能的诊断决策支持系统的安全使用进行了研究:日本独协医科大学开展了一项准实验研究,将医生按 1:1 分配到干预组和对照组。根据人工智能驱动的自动病史采集系统记录的病史,共创建了 20 个临床病例,这些病例来自在日本一家社区医院就诊的实际患者。参与者通过人工智能自动病史采集系统生成的 10 个鉴别诊断列表,查看了 20 个临床病例的病史,并提供了 1 到 3 个可能的诊断。在干预组中,医生会被问及最终诊断是否在人工智能生成的 10 个鉴别诊断列表中,以此作为信任校准。我们分析了干预组中医生的诊断准确性和信任校准的正确性。我们还调查了信任校准的准确性与医生诊断准确性之间的关系,以及医生对使用人工智能的信心水平:在被分配到干预组(10 人)和对照组(10 人)的 20 名医生中,平均年龄分别为 30.9 岁(标清 3.9 岁)和 31.7 岁(标清 4.2 岁),男性比例分别为 80% 和 60%,平均研究生年限分别为 5.8 年(标清 2.9 年)和 7.2 年(标清 4.6 年),无显著差异。干预组医生的诊断准确率为 41.5%,对照组为 46%,无明显差异(95% CI -0.75 至 2.55;P=0.27)。信任校准的总体准确率仅为 61.5%,尽管校准正确,诊断准确率仍为 54.5%。在多变量逻辑回归模型中,信任校准的准确性是影响医生诊断准确性的一个重要因素(调整后的几率比为 5.90,95% CI 为 2.93-12.46;PC 结论:在考虑通过阅读病史生成的鉴别诊断和人工智能驱动的自动病史采集系统的可能鉴别诊断列表时,信任校准并未明显提高医生的诊断准确性。由于这是一项形成性研究,样本量较小和信任校准方法不理想可能是导致缺乏显著差异的原因。本研究强调了扩大样本量和实施信任校准辅助措施的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信