Karen Drukker, Samuel G Armato, Lubomir Hadjiiski, Judy Gichoya, Nicholas Gruszauskas, Jayashree Kalpathy-Cramer, Hui Li, Kyle J Myers, Robert M Tomek, Heather M Whitney, Zi Zhang, Maryellen L Giger
{"title":"肺炎严重程度的机器学习评估:医学成像和数据资源中心改进的肺水肿主脑挑战影像学评估的亚组表现。","authors":"Karen Drukker, Samuel G Armato, Lubomir Hadjiiski, Judy Gichoya, Nicholas Gruszauskas, Jayashree Kalpathy-Cramer, Hui Li, Kyle J Myers, Robert M Tomek, Heather M Whitney, Zi Zhang, Maryellen L Giger","doi":"10.1117/1.JMI.12.5.054502","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The Medical Imaging and Data Resource Center Mastermind Grand Challenge of modified radiographic assessment of lung edema (mRALE) tasked participants with developing machine learning techniques for automated COVID-19 severity assessment via mRALE scores on portable chest radiographs (CXRs). We examine potential biases across demographic subgroups for the best-performing models of the nine teams participating in the test phase of the challenge.</p><p><strong>Approach: </strong>Models were evaluated against a nonpublic test set of CXRs (814 patients) annotated by radiologists for disease severity (mRALE score 0 to 24). Participants used a variety of data and methods for training. Performance was measured using quadratic-weighted kappa (QWK). Bias analyses considered demographics (sex, age, race, ethnicity, and their intersections) using QWK. In addition, for distinguishing no/mild versus moderate/severe disease, equal opportunity difference (EOD) and average absolute odds difference (AAOD) were calculated. Bias was defined as statistically significant QWK subgroup differences, or EOD outside [ <math><mrow><mo>-</mo> <mn>0.1</mn></mrow> </math> ; 0.1], or AAOD outside [0; 0.1].</p><p><strong>Results: </strong>The nine models demonstrated good agreement with the reference standard (QWK 0.74 to 0.88). The winning model (QWK = 0.884 [0.819; 0.949]) was the only model without biases identified in terms of QWK. The runner-up model (QWK = 0.874 [0.813; 0.936]) showed no identified biases in terms of EOD and AAOD, whereas the winning model disadvantaged three subgroups in each of these metrics. The median number of disadvantaged subgroups for all models was 3.</p><p><strong>Conclusions: </strong>The challenge demonstrated strong model performances but identified subgroup disparities. Bias analysis is essential as models with similar accuracy may exhibit varying fairness.</p>","PeriodicalId":47707,"journal":{"name":"Journal of Medical Imaging","volume":"12 5","pages":"054502"},"PeriodicalIF":1.7000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12503059/pdf/","citationCount":"0","resultStr":"{\"title\":\"Machine learning evaluation of pneumonia severity: subgroup performance in the Medical Imaging and Data Resource Center modified radiographic assessment of lung edema mastermind challenge.\",\"authors\":\"Karen Drukker, Samuel G Armato, Lubomir Hadjiiski, Judy Gichoya, Nicholas Gruszauskas, Jayashree Kalpathy-Cramer, Hui Li, Kyle J Myers, Robert M Tomek, Heather M Whitney, Zi Zhang, Maryellen L Giger\",\"doi\":\"10.1117/1.JMI.12.5.054502\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>The Medical Imaging and Data Resource Center Mastermind Grand Challenge of modified radiographic assessment of lung edema (mRALE) tasked participants with developing machine learning techniques for automated COVID-19 severity assessment via mRALE scores on portable chest radiographs (CXRs). We examine potential biases across demographic subgroups for the best-performing models of the nine teams participating in the test phase of the challenge.</p><p><strong>Approach: </strong>Models were evaluated against a nonpublic test set of CXRs (814 patients) annotated by radiologists for disease severity (mRALE score 0 to 24). Participants used a variety of data and methods for training. Performance was measured using quadratic-weighted kappa (QWK). Bias analyses considered demographics (sex, age, race, ethnicity, and their intersections) using QWK. In addition, for distinguishing no/mild versus moderate/severe disease, equal opportunity difference (EOD) and average absolute odds difference (AAOD) were calculated. Bias was defined as statistically significant QWK subgroup differences, or EOD outside [ <math><mrow><mo>-</mo> <mn>0.1</mn></mrow> </math> ; 0.1], or AAOD outside [0; 0.1].</p><p><strong>Results: </strong>The nine models demonstrated good agreement with the reference standard (QWK 0.74 to 0.88). The winning model (QWK = 0.884 [0.819; 0.949]) was the only model without biases identified in terms of QWK. The runner-up model (QWK = 0.874 [0.813; 0.936]) showed no identified biases in terms of EOD and AAOD, whereas the winning model disadvantaged three subgroups in each of these metrics. The median number of disadvantaged subgroups for all models was 3.</p><p><strong>Conclusions: </strong>The challenge demonstrated strong model performances but identified subgroup disparities. Bias analysis is essential as models with similar accuracy may exhibit varying fairness.</p>\",\"PeriodicalId\":47707,\"journal\":{\"name\":\"Journal of Medical Imaging\",\"volume\":\"12 5\",\"pages\":\"054502\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12503059/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1117/1.JMI.12.5.054502\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/10/7 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1117/1.JMI.12.5.054502","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/7 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
摘要
目的:医学成像和数据资源中心(Medical Imaging and Data Resource Center)发起了改进肺水肿放射学评估(mRALE)的大挑战,要求参与者开发机器学习技术,通过便携式胸片(cxr)上的mRALE评分自动评估COVID-19严重程度。我们检查了参与挑战测试阶段的九个团队中表现最佳的模型在人口统计子组中的潜在偏差。方法:根据放射科医生对疾病严重程度(mRALE评分0至24分)注释的非公开cxr测试集(814例患者)对模型进行评估。参与者使用各种数据和方法进行训练。使用二次加权kappa (QWK)来测量性能。偏差分析使用QWK考虑人口统计学(性别、年龄、种族、民族及其交集)。此外,为了区分无/轻度与中度/重度疾病,计算了均等机会差(EOD)和平均绝对优势差(AAOD)。偏倚定义为统计学上显著的QWK亚组差异,或EOD外[- 0.1;0.1],或AAOD外[0;0.1]。结果:9个模型与参考标准的QWK值(0.74 ~ 0.88)吻合较好。获胜的模型(QWK = 0.884[0.819; 0.949])是唯一一个在QWK方面没有发现偏差的模型。第二名模型(QWK = 0.874[0.813; 0.936])在EOD和AAOD方面没有明显的偏差,而获胜模型在这些指标中都有三个亚组处于劣势。所有模型的弱势亚组中位数为3。结论:挑战证明了强大的模型性能,但确定了亚组差异。偏差分析是必不可少的,因为具有相似精度的模型可能表现出不同的公平性。
Machine learning evaluation of pneumonia severity: subgroup performance in the Medical Imaging and Data Resource Center modified radiographic assessment of lung edema mastermind challenge.
Purpose: The Medical Imaging and Data Resource Center Mastermind Grand Challenge of modified radiographic assessment of lung edema (mRALE) tasked participants with developing machine learning techniques for automated COVID-19 severity assessment via mRALE scores on portable chest radiographs (CXRs). We examine potential biases across demographic subgroups for the best-performing models of the nine teams participating in the test phase of the challenge.
Approach: Models were evaluated against a nonpublic test set of CXRs (814 patients) annotated by radiologists for disease severity (mRALE score 0 to 24). Participants used a variety of data and methods for training. Performance was measured using quadratic-weighted kappa (QWK). Bias analyses considered demographics (sex, age, race, ethnicity, and their intersections) using QWK. In addition, for distinguishing no/mild versus moderate/severe disease, equal opportunity difference (EOD) and average absolute odds difference (AAOD) were calculated. Bias was defined as statistically significant QWK subgroup differences, or EOD outside [ ; 0.1], or AAOD outside [0; 0.1].
Results: The nine models demonstrated good agreement with the reference standard (QWK 0.74 to 0.88). The winning model (QWK = 0.884 [0.819; 0.949]) was the only model without biases identified in terms of QWK. The runner-up model (QWK = 0.874 [0.813; 0.936]) showed no identified biases in terms of EOD and AAOD, whereas the winning model disadvantaged three subgroups in each of these metrics. The median number of disadvantaged subgroups for all models was 3.
Conclusions: The challenge demonstrated strong model performances but identified subgroup disparities. Bias analysis is essential as models with similar accuracy may exhibit varying fairness.
期刊介绍:
JMI covers fundamental and translational research, as well as applications, focused on medical imaging, which continue to yield physical and biomedical advancements in the early detection, diagnostics, and therapy of disease as well as in the understanding of normal. The scope of JMI includes: Imaging physics, Tomographic reconstruction algorithms (such as those in CT and MRI), Image processing and deep learning, Computer-aided diagnosis and quantitative image analysis, Visualization and modeling, Picture archiving and communications systems (PACS), Image perception and observer performance, Technology assessment, Ultrasonic imaging, Image-guided procedures, Digital pathology, Biomedical applications of biomedical imaging. JMI allows for the peer-reviewed communication and archiving of scientific developments, translational and clinical applications, reviews, and recommendations for the field.