在不确定的地面实况下评估皮肤病学中的医疗人工智能系统

IF 10.7 1区 医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
David Stutz , Ali Taylan Cemgil , Abhijit Guha Roy , Tatiana Matejovicova , Melih Barsbey , Patricia Strachan , Mike Schaekermann , Jan Freyberg , Rajeev Rikhye , Beverly Freeman , Javier Perez Matos , Umesh Telang , Dale R. Webster , Yuan Liu , Greg S. Corrado , Yossi Matias , Pushmeet Kohli , Yun Liu , Arnaud Doucet , Alan Karthikesalingam
{"title":"在不确定的地面实况下评估皮肤病学中的医疗人工智能系统","authors":"David Stutz ,&nbsp;Ali Taylan Cemgil ,&nbsp;Abhijit Guha Roy ,&nbsp;Tatiana Matejovicova ,&nbsp;Melih Barsbey ,&nbsp;Patricia Strachan ,&nbsp;Mike Schaekermann ,&nbsp;Jan Freyberg ,&nbsp;Rajeev Rikhye ,&nbsp;Beverly Freeman ,&nbsp;Javier Perez Matos ,&nbsp;Umesh Telang ,&nbsp;Dale R. Webster ,&nbsp;Yuan Liu ,&nbsp;Greg S. Corrado ,&nbsp;Yossi Matias ,&nbsp;Pushmeet Kohli ,&nbsp;Yun Liu ,&nbsp;Arnaud Doucet ,&nbsp;Alan Karthikesalingam","doi":"10.1016/j.media.2025.103556","DOIUrl":null,"url":null,"abstract":"<div><div>For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a <em>statistical aggregation</em> approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire <em>ground truth uncertainty</em>. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-<span><math><mi>k</mi></math></span> accuracy and average overlap. In the skin condition classification problem of Liu <em>et al</em>., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth <em>may</em> be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"103 ","pages":"Article 103556"},"PeriodicalIF":10.7000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating medical AI systems in dermatology under uncertain ground truth\",\"authors\":\"David Stutz ,&nbsp;Ali Taylan Cemgil ,&nbsp;Abhijit Guha Roy ,&nbsp;Tatiana Matejovicova ,&nbsp;Melih Barsbey ,&nbsp;Patricia Strachan ,&nbsp;Mike Schaekermann ,&nbsp;Jan Freyberg ,&nbsp;Rajeev Rikhye ,&nbsp;Beverly Freeman ,&nbsp;Javier Perez Matos ,&nbsp;Umesh Telang ,&nbsp;Dale R. Webster ,&nbsp;Yuan Liu ,&nbsp;Greg S. Corrado ,&nbsp;Yossi Matias ,&nbsp;Pushmeet Kohli ,&nbsp;Yun Liu ,&nbsp;Arnaud Doucet ,&nbsp;Alan Karthikesalingam\",\"doi\":\"10.1016/j.media.2025.103556\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a <em>statistical aggregation</em> approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire <em>ground truth uncertainty</em>. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-<span><math><mi>k</mi></math></span> accuracy and average overlap. In the skin condition classification problem of Liu <em>et al</em>., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth <em>may</em> be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.</div></div>\",\"PeriodicalId\":18328,\"journal\":{\"name\":\"Medical image analysis\",\"volume\":\"103 \",\"pages\":\"Article 103556\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical image analysis\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1361841525001033\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525001033","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

为了安全起见,医疗人工智能系统在部署前都要经过全面评估,根据假定为固定和确定的基本事实来验证其预测结果。然而,在医疗应用中,这一基本事实往往是由多位专家以鉴别诊断的形式提供的。单个鉴别诊断反映了一个专家评估的不确定性,而多个专家则通过潜在的分歧引入了另一层不确定性。在标准评估中,这两种形式的不确定性都被忽略了,标准评估会将这些鉴别诊断汇总为一个标签。本文表明,忽略不确定性会导致对模型性能的估计过于乐观,从而低估与特定诊断决策相关的风险。此外,点估算在很大程度上忽略了单个病例在不确定性方面的巨大差异。为此,我们提出了一种统计聚合方法,即根据观察到的注释推断出潜在候选病症本身的概率分布。这种方法自然会考虑到不同专家之间可能存在的分歧,以及个体差异诊断带来的不确定性,从而捕捉到整个地面实况的不确定性。实际上,我们的方法可以归结为生成多个医疗条件概率样本,然后根据这些样本概率评估和平均性能指标,而不是依赖于单点估计。这样,我们就能提供常见指标的不确定性调整估计值,如 top-k 准确率和平均重叠率。在 Liu 等人(2020 年)的皮肤状况分类问题中,我们的方法揭示了大多数数据点的显著地面实况不确定性,并证明标准评估技术可能会高估性能几个百分点。我们的结论是,虽然假定一个清晰的地面实况对于许多人工智能应用来说都是可以接受的,但在医疗诊断中,应该使用一个更细致的评估协议,承认鉴别诊断固有的复杂性和可变性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating medical AI systems in dermatology under uncertain ground truth
For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-k accuracy and average overlap. In the skin condition classification problem of Liu et al., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth may be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Medical image analysis
Medical image analysis 工程技术-工程:生物医学
CiteScore
22.10
自引率
6.40%
发文量
309
审稿时长
6.6 months
期刊介绍: Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信