评估预测COVID-19结果的医疗人工智能模型中未识别偏差的客观框架

Journal of the American Medical Informatics Association : JAMIA Pub Date : 2022-05-02 DOI:10.1093/jamia/ocac070

Hossein Estiri, Z. Strasser, S. Rashidian, Jeffrey G. Klann, K. Wagholikar, T. McCoy, S. Murphy

{"title":"评估预测COVID-19结果的医疗人工智能模型中未识别偏差的客观框架","authors":"Hossein Estiri, Z. Strasser, S. Rashidian, Jeffrey G. Klann, K. Wagholikar, T. McCoy, S. Murphy","doi":"10.1093/jamia/ocac070","DOIUrl":null,"url":null,"abstract":"Abstract Objective The increasing translation of artificial intelligence (AI)/machine learning (ML) models into clinical practice brings an increased risk of direct harm from modeling bias; however, bias remains incompletely measured in many medical AI applications. This article aims to provide a framework for objective evaluation of medical AI from multiple aspects, focusing on binary classification models. Materials and Methods Using data from over 56 000 Mass General Brigham (MGB) patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), we evaluate unrecognized bias in 4 AI models developed during the early months of the pandemic in Boston, Massachusetts that predict risks of hospital admission, ICU admission, mechanical ventilation, and death after a SARS-CoV-2 infection purely based on their pre-infection longitudinal medical records. Models were evaluated both retrospectively and prospectively using model-level metrics of discrimination, accuracy, and reliability, and a novel individual-level metric for error. Results We found inconsistent instances of model-level bias in the prediction models. From an individual-level aspect, however, we found most all models performing with slightly higher error rates for older patients. Discussion While a model can be biased against certain protected groups (ie, perform worse) in certain tasks, it can be at the same time biased towards another protected group (ie, perform better). As such, current bias evaluation studies may lack a full depiction of the variable effects of a model on its subpopulations. Conclusion Only a holistic evaluation, a diligent search for unrecognized bias, can provide enough information for an unbiased judgment of AI bias that can invigorate follow-up investigations on identifying the underlying roots of bias and ultimately make a change.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"An objective framework for evaluating unrecognized bias in medical AI models predicting COVID-19 outcomes\",\"authors\":\"Hossein Estiri, Z. Strasser, S. Rashidian, Jeffrey G. Klann, K. Wagholikar, T. McCoy, S. Murphy\",\"doi\":\"10.1093/jamia/ocac070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Objective The increasing translation of artificial intelligence (AI)/machine learning (ML) models into clinical practice brings an increased risk of direct harm from modeling bias; however, bias remains incompletely measured in many medical AI applications. This article aims to provide a framework for objective evaluation of medical AI from multiple aspects, focusing on binary classification models. Materials and Methods Using data from over 56 000 Mass General Brigham (MGB) patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), we evaluate unrecognized bias in 4 AI models developed during the early months of the pandemic in Boston, Massachusetts that predict risks of hospital admission, ICU admission, mechanical ventilation, and death after a SARS-CoV-2 infection purely based on their pre-infection longitudinal medical records. Models were evaluated both retrospectively and prospectively using model-level metrics of discrimination, accuracy, and reliability, and a novel individual-level metric for error. Results We found inconsistent instances of model-level bias in the prediction models. From an individual-level aspect, however, we found most all models performing with slightly higher error rates for older patients. Discussion While a model can be biased against certain protected groups (ie, perform worse) in certain tasks, it can be at the same time biased towards another protected group (ie, perform better). As such, current bias evaluation studies may lack a full depiction of the variable effects of a model on its subpopulations. Conclusion Only a holistic evaluation, a diligent search for unrecognized bias, can provide enough information for an unbiased judgment of AI bias that can invigorate follow-up investigations on identifying the underlying roots of bias and ultimately make a change.\",\"PeriodicalId\":236137,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association : JAMIA\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association : JAMIA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocac070\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamia/ocac070","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

摘要目的人工智能(AI)/机器学习(ML)模型越来越多地应用于临床实践，导致建模偏差直接危害的风险增加;然而，在许多医疗人工智能应用中，偏见仍然无法完全衡量。本文旨在从多个方面提供一个客观评价医疗人工智能的框架，重点是二元分类模型。使用来自56000多名确诊为严重急性呼吸综合征冠状病毒2 (SARS-CoV-2)的麻省总医院(MGB)患者的数据，我们评估了在马萨诸塞州波士顿大流行最初几个月开发的4种人工智能模型中未被认识到的偏差，这些模型纯粹基于感染前的长期医疗记录来预测SARS-CoV-2感染后住院、ICU住院、机械通气和死亡的风险。使用模型级别的判别、准确性和可靠性指标，以及一种新的个人级别的误差指标，对模型进行回顾性和前瞻性评估。结果我们在预测模型中发现了不一致的模型水平偏差实例。然而，从个人层面来看，我们发现大多数模型对老年患者的错误率略高。虽然一个模型在某些任务中可能对某些保护组(即执行得更差)有偏见，但它同时也可能对另一个保护组(即执行得更好)有偏见。因此，目前的偏倚评估研究可能缺乏对模型对其亚群的可变影响的充分描述。只有全面的评估，勤奋地寻找未被认识到的偏见，才能为对人工智能偏见的公正判断提供足够的信息，从而为后续调查提供活力，以确定偏见的根本原因，并最终做出改变。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An objective framework for evaluating unrecognized bias in medical AI models predicting COVID-19 outcomes

Abstract Objective The increasing translation of artificial intelligence (AI)/machine learning (ML) models into clinical practice brings an increased risk of direct harm from modeling bias; however, bias remains incompletely measured in many medical AI applications. This article aims to provide a framework for objective evaluation of medical AI from multiple aspects, focusing on binary classification models. Materials and Methods Using data from over 56 000 Mass General Brigham (MGB) patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), we evaluate unrecognized bias in 4 AI models developed during the early months of the pandemic in Boston, Massachusetts that predict risks of hospital admission, ICU admission, mechanical ventilation, and death after a SARS-CoV-2 infection purely based on their pre-infection longitudinal medical records. Models were evaluated both retrospectively and prospectively using model-level metrics of discrimination, accuracy, and reliability, and a novel individual-level metric for error. Results We found inconsistent instances of model-level bias in the prediction models. From an individual-level aspect, however, we found most all models performing with slightly higher error rates for older patients. Discussion While a model can be biased against certain protected groups (ie, perform worse) in certain tasks, it can be at the same time biased towards another protected group (ie, perform better). As such, current bias evaluation studies may lack a full depiction of the variable effects of a model on its subpopulations. Conclusion Only a holistic evaluation, a diligent search for unrecognized bias, can provide enough information for an unbiased judgment of AI bias that can invigorate follow-up investigations on identifying the underlying roots of bias and ultimately make a change.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Medical Informatics Association : JAMIA

自引率

0.00%

发文量