Michael H Bernstein, Marly van Assen, Michael A Bruno, Elizabeth A Krupinski, Carlo De Cecco, Grayson L Baird
{"title":"一个分数就够了吗?人工智能严重性评分的陷阱和解决方案。","authors":"Michael H Bernstein, Marly van Assen, Michael A Bruno, Elizabeth A Krupinski, Carlo De Cecco, Grayson L Baird","doi":"10.1186/s41747-025-00603-z","DOIUrl":null,"url":null,"abstract":"<p><p>Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generated. In this comment, we draw on key principles from psychological science and statistics to elucidate six human factors limitations of AI scores that undermine their utility: (1) variability across AI systems; (2) variability within AI systems; (3) variability between radiologists; (4) variability within radiologists; (5) unknown distribution of AI scores; and (6) perceptual challenges. We hypothesize that these limitations can be mitigated by providing the false discovery rate and false omission rate for each score as a threshold. We discuss how this hypothesis could be empirically tested. KEY POINTS: The radiologist-AI interaction has not been given sufficient attention. The utility of AI scores is limited by six key human factors limitations. We propose a hypothesis for how to mitigate these limitations by using false discovery rate and false omission rate.</p>","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"67"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12259500/pdf/","citationCount":"0","resultStr":"{\"title\":\"Is a score enough? Pitfalls and solutions for AI severity scores.\",\"authors\":\"Michael H Bernstein, Marly van Assen, Michael A Bruno, Elizabeth A Krupinski, Carlo De Cecco, Grayson L Baird\",\"doi\":\"10.1186/s41747-025-00603-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generated. In this comment, we draw on key principles from psychological science and statistics to elucidate six human factors limitations of AI scores that undermine their utility: (1) variability across AI systems; (2) variability within AI systems; (3) variability between radiologists; (4) variability within radiologists; (5) unknown distribution of AI scores; and (6) perceptual challenges. We hypothesize that these limitations can be mitigated by providing the false discovery rate and false omission rate for each score as a threshold. We discuss how this hypothesis could be empirically tested. KEY POINTS: The radiologist-AI interaction has not been given sufficient attention. The utility of AI scores is limited by six key human factors limitations. We propose a hypothesis for how to mitigate these limitations by using false discovery rate and false omission rate.</p>\",\"PeriodicalId\":36926,\"journal\":{\"name\":\"European Radiology Experimental\",\"volume\":\"9 1\",\"pages\":\"67\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12259500/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Radiology Experimental\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s41747-025-00603-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-025-00603-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Is a score enough? Pitfalls and solutions for AI severity scores.
Severity scores, which often refer to the likelihood or probability of a pathology, are commonly provided by artificial intelligence (AI) tools in radiology. However, little attention has been given to the use of these AI scores, and there is a lack of transparency into how they are generated. In this comment, we draw on key principles from psychological science and statistics to elucidate six human factors limitations of AI scores that undermine their utility: (1) variability across AI systems; (2) variability within AI systems; (3) variability between radiologists; (4) variability within radiologists; (5) unknown distribution of AI scores; and (6) perceptual challenges. We hypothesize that these limitations can be mitigated by providing the false discovery rate and false omission rate for each score as a threshold. We discuss how this hypothesis could be empirically tested. KEY POINTS: The radiologist-AI interaction has not been given sufficient attention. The utility of AI scores is limited by six key human factors limitations. We propose a hypothesis for how to mitigate these limitations by using false discovery rate and false omission rate.