David Stutz , Ali Taylan Cemgil , Abhijit Guha Roy , Tatiana Matejovicova , Melih Barsbey , Patricia Strachan , Mike Schaekermann , Jan Freyberg , Rajeev Rikhye , Beverly Freeman , Javier Perez Matos , Umesh Telang , Dale R. Webster , Yuan Liu , Greg S. Corrado , Yossi Matias , Pushmeet Kohli , Yun Liu , Arnaud Doucet , Alan Karthikesalingam
{"title":"Evaluating medical AI systems in dermatology under uncertain ground truth","authors":"David Stutz , Ali Taylan Cemgil , Abhijit Guha Roy , Tatiana Matejovicova , Melih Barsbey , Patricia Strachan , Mike Schaekermann , Jan Freyberg , Rajeev Rikhye , Beverly Freeman , Javier Perez Matos , Umesh Telang , Dale R. Webster , Yuan Liu , Greg S. Corrado , Yossi Matias , Pushmeet Kohli , Yun Liu , Arnaud Doucet , Alan Karthikesalingam","doi":"10.1016/j.media.2025.103556","DOIUrl":null,"url":null,"abstract":"<div><div>For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a <em>statistical aggregation</em> approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire <em>ground truth uncertainty</em>. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-<span><math><mi>k</mi></math></span> accuracy and average overlap. In the skin condition classification problem of Liu <em>et al</em>., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth <em>may</em> be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"103 ","pages":"Article 103556"},"PeriodicalIF":10.7000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525001033","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top- accuracy and average overlap. In the skin condition classification problem of Liu et al., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth may be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.
期刊介绍:
Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.