The Impact of Race, Ethnicity, and Sex on Fairness in Artificial Intelligence for Glaucoma Prediction Models

IF 3.2 Q1 OPHTHALMOLOGY

Ophthalmology science Pub Date : 2024-08-14 DOI:10.1016/j.xops.2024.100596

Rohith Ravindranath MS , Joshua D. Stein MD, MS , Tina Hernandez-Boussard , A. Caroline Fisher , Sophia Y. Wang MD, MS

{"title":"The Impact of Race, Ethnicity, and Sex on Fairness in Artificial Intelligence for Glaucoma Prediction Models","authors":"Rohith Ravindranath MS , Joshua D. Stein MD, MS , Tina Hernandez-Boussard , A. Caroline Fisher , Sophia Y. Wang MD, MS","doi":"10.1016/j.xops.2024.100596","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Despite advances in artificial intelligence (AI) in glaucoma prediction, most works lack multicenter focus and do not consider fairness concerning sex, race, or ethnicity. This study aims to examine the impact of these sensitive attributes on developing fair AI models that predict glaucoma progression to necessitating incisional glaucoma surgery.</div></div><div><h3>Design</h3><div>Database study.</div></div><div><h3>Participants</h3><div>Thirty-nine thousand ninety patients with glaucoma, as identified by International Classification of Disease codes from 7 academic eye centers participating in the Sight OUtcomes Research Collaborative.</div></div><div><h3>Methods</h3><div>We developed XGBoost models using 3 approaches: (1) excluding sensitive attributes as input features, (2) including them explicitly as input features, and (3) training separate models for each group. Model input features included demographic details, diagnosis codes, medications, and clinical information (intraocular pressure, visual acuity, etc.), from electronic health records. The models were trained on patients from 5 sites (N = 27 999) and evaluated on a held-out internal test set (N = 3499) and 2 external test sets consisting of N = 1550 and N = 2542 patients.</div></div><div><h3>Main Outcomes and Measures</h3><div>Area under the receiver operating characteristic curve (AUROC) and equalized odds on the test set and external sites.</div></div><div><h3>Results</h3><div>Six thousand six hundred eighty-two (17.1%) of 39 090 patients underwent glaucoma surgery with a mean age of 70.1 (standard deviation 14.6) years, 54.5% female, 62.3% White, 22.1% Black, and 4.7% Latinx/Hispanic. We found that not including the sensitive attributes led to better classification performance (AUROC: 0.77–0.82) but worsened fairness when evaluated on the internal test set. However, on external test sites, the opposite was true: including sensitive attributes resulted in better classification performance (AUROC: external #1 - [0.73–0.81], external #2 - [0.67–0.70]), but varying degrees of fairness for sex and race as measured by equalized odds.</div></div><div><h3>Conclusions</h3><div>Artificial intelligence models predicting whether patients with glaucoma progress to surgery demonstrated bias with respect to sex, race, and ethnicity. The effect of sensitive attribute inclusion and exclusion on fairness and performance varied based on internal versus external test sets. Prior to deployment, AI models should be evaluated for fairness on the target population.</div></div><div><h3>Financial Disclosures</h3><div>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"5 1","pages":"Article 100596"},"PeriodicalIF":3.2000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666914524001325/pdfft?md5=a7947c05f20d148756a130892f021b56&pid=1-s2.0-S2666914524001325-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914524001325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Despite advances in artificial intelligence (AI) in glaucoma prediction, most works lack multicenter focus and do not consider fairness concerning sex, race, or ethnicity. This study aims to examine the impact of these sensitive attributes on developing fair AI models that predict glaucoma progression to necessitating incisional glaucoma surgery.

Design

Database study.

Participants

Thirty-nine thousand ninety patients with glaucoma, as identified by International Classification of Disease codes from 7 academic eye centers participating in the Sight OUtcomes Research Collaborative.

Methods

We developed XGBoost models using 3 approaches: (1) excluding sensitive attributes as input features, (2) including them explicitly as input features, and (3) training separate models for each group. Model input features included demographic details, diagnosis codes, medications, and clinical information (intraocular pressure, visual acuity, etc.), from electronic health records. The models were trained on patients from 5 sites (N = 27 999) and evaluated on a held-out internal test set (N = 3499) and 2 external test sets consisting of N = 1550 and N = 2542 patients.

Main Outcomes and Measures

Area under the receiver operating characteristic curve (AUROC) and equalized odds on the test set and external sites.

Results

Six thousand six hundred eighty-two (17.1%) of 39 090 patients underwent glaucoma surgery with a mean age of 70.1 (standard deviation 14.6) years, 54.5% female, 62.3% White, 22.1% Black, and 4.7% Latinx/Hispanic. We found that not including the sensitive attributes led to better classification performance (AUROC: 0.77–0.82) but worsened fairness when evaluated on the internal test set. However, on external test sites, the opposite was true: including sensitive attributes resulted in better classification performance (AUROC: external #1 - [0.73–0.81], external #2 - [0.67–0.70]), but varying degrees of fairness for sex and race as measured by equalized odds.

Conclusions

Artificial intelligence models predicting whether patients with glaucoma progress to surgery demonstrated bias with respect to sex, race, and ethnicity. The effect of sensitive attribute inclusion and exclusion on fairness and performance varied based on internal versus external test sets. Prior to deployment, AI models should be evaluated for fairness on the target population.

Financial Disclosures

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

查看原文本刊更多论文

种族、民族和性别对人工智能青光眼预测模型公平性的影响

目的尽管人工智能（AI）在青光眼预测方面取得了进展，但大多数研究都缺乏多中心关注，也没有考虑性别、种族或民族方面的公平性。本研究旨在研究这些敏感属性对开发公平的人工智能模型的影响，这些模型可预测青光眼进展到必须进行青光眼切开手术的程度。方法我们使用三种方法开发了 XGBoost 模型：(1) 排除敏感属性作为输入特征；(2) 明确将敏感属性作为输入特征；(3) 为每个组别训练单独的模型。模型输入特征包括电子健康记录中的人口统计学细节、诊断代码、药物和临床信息（眼压、视力等）。模型在 5 个站点（N = 27999）的患者身上进行了训练，并在内部测试集（N = 3499）和由 N = 1550 和 N = 2542 名患者组成的 2 个外部测试集上进行了评估。结果39 090 名患者中有 662 人（17.1%）接受了青光眼手术，平均年龄 70.1 岁（标准差 14.6），54.5% 为女性，62.3% 为白人，22.1% 为黑人，4.7% 为拉丁/西班牙裔。我们发现，在内部测试集上进行评估时，不包含敏感属性会提高分类性能（AUROC：0.77-0.82），但会降低公平性。然而，在外部测试点上，情况却恰恰相反：包含敏感属性会带来更好的分类性能（AUROC：外部 #1 - [0.73-0.81]，外部 #2 - [0.67-0.70]），但根据等化几率衡量，性别和种族的公平性程度各不相同。敏感属性的包含和排除对公平性和性能的影响因内部测试集和外部测试集而异。在部署之前，应评估人工智能模型在目标人群中的公平性。财务披露专利或商业披露见本文末尾的脚注和披露。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊