Eui Jin Hwang, Jong Hyuk Lee, Woo Hyeon Lim, Won Gi Jeong, Wonju Hong, Jongsoo Park, Seung-Jin Yoo, Hyungjin Kim
求助PDF
{"title":"Clinical Validation of a Generative Artificial Intelligence Model for Chest Radiograph Reporting: A Multicohort Study.","authors":"Eui Jin Hwang, Jong Hyuk Lee, Woo Hyeon Lim, Won Gi Jeong, Wonju Hong, Jongsoo Park, Seung-Jin Yoo, Hyungjin Kim","doi":"10.1148/radiol.250568","DOIUrl":null,"url":null,"abstract":"<p><p>Background Artificial intelligence (AI)-generated radiology reports have become available and require rigorous evaluation. Purpose To evaluate the clinical acceptability of chest radiograph reports generated by an AI algorithm and their accuracy in identifying referable abnormalities. Materials and Methods Chest radiographs from an intensive care unit (ICU), an emergency department, and health checkups were retrospectively collected between January 2020 and December 2022, and outpatient chest radiographs were sourced from a public dataset. An automated report-generating AI algorithm was then applied. A panel of seven thoracic radiologists evaluated the acceptability of generated reports, and acceptability was analyzed using a standard criterion (acceptable without revision or with minor revision) and a stringent criterion (acceptable without revision). Using chest radiographs from three of the contexts (excluding the ICU), AI-generated and radiologist-written reports were compared regarding the acceptability of the reports (generalized linear mixed model) and their sensitivity and specificity for identifying referable abnormalities (McNemar test). The radiologist panel was surveyed to evaluate their perspectives on the potential of AI-generated reports to replace radiologist-written reports. Results The chest radiographs of 1539 individuals (median age, 55 years; 656 male patients, 483 female patients, 400 patients of unknown sex) were included. There was no evidence of a difference in acceptability between AI-generated and radiologist-written reports under the standard criterion (88.4% vs 89.2%; <i>P</i> = .36), but AI-generated reports were less acceptable than radiologist-written reports under the stringent criterion (66.8% vs 75.7%; <i>P</i> < .001). Compared with radiologist-written reports, AI-generated reports identified radiographs with referable abnormalities with greater sensitivity (81.2% vs 59.4%; <i>P</i> < .001) and lower specificity (81.0% vs 93.6%; <i>P</i> < .001). In the survey, most radiologists indicated that AI-generated reports were not yet reliable enough to replace radiologist-written reports. Conclusion AI-generated chest radiograph reports had similar acceptability to radiologist-written reports, although a substantial proportion of AI-generated reports required minor revision. © RSNA, 2025 <i>Supplemental material is available for this article.</i> See also the editorial by Wu and Seo in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"316 3","pages":"e250568"},"PeriodicalIF":15.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.250568","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Artificial intelligence (AI)-generated radiology reports have become available and require rigorous evaluation. Purpose To evaluate the clinical acceptability of chest radiograph reports generated by an AI algorithm and their accuracy in identifying referable abnormalities. Materials and Methods Chest radiographs from an intensive care unit (ICU), an emergency department, and health checkups were retrospectively collected between January 2020 and December 2022, and outpatient chest radiographs were sourced from a public dataset. An automated report-generating AI algorithm was then applied. A panel of seven thoracic radiologists evaluated the acceptability of generated reports, and acceptability was analyzed using a standard criterion (acceptable without revision or with minor revision) and a stringent criterion (acceptable without revision). Using chest radiographs from three of the contexts (excluding the ICU), AI-generated and radiologist-written reports were compared regarding the acceptability of the reports (generalized linear mixed model) and their sensitivity and specificity for identifying referable abnormalities (McNemar test). The radiologist panel was surveyed to evaluate their perspectives on the potential of AI-generated reports to replace radiologist-written reports. Results The chest radiographs of 1539 individuals (median age, 55 years; 656 male patients, 483 female patients, 400 patients of unknown sex) were included. There was no evidence of a difference in acceptability between AI-generated and radiologist-written reports under the standard criterion (88.4% vs 89.2%; P = .36), but AI-generated reports were less acceptable than radiologist-written reports under the stringent criterion (66.8% vs 75.7%; P < .001). Compared with radiologist-written reports, AI-generated reports identified radiographs with referable abnormalities with greater sensitivity (81.2% vs 59.4%; P < .001) and lower specificity (81.0% vs 93.6%; P < .001). In the survey, most radiologists indicated that AI-generated reports were not yet reliable enough to replace radiologist-written reports. Conclusion AI-generated chest radiograph reports had similar acceptability to radiologist-written reports, although a substantial proportion of AI-generated reports required minor revision. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Wu and Seo in this issue.
生成式人工智能胸片报告模型的临床验证:一项多队列研究。
人工智能(AI)生成的放射学报告已经可用,需要严格的评估。目的评价人工智能算法生成的胸片报告的临床可接受性及其识别可参考异常的准确性。材料和方法回顾性收集2020年1月至2022年12月期间重症监护病房(ICU)、急诊科和健康检查的胸片,门诊胸片来自公共数据集。然后应用自动生成报告的人工智能算法。由7名胸椎放射科医师组成的小组评估生成报告的可接受性,并使用标准标准(可接受,无需修订或少量修订)和严格标准(可接受,无需修订)对可接受性进行分析。使用来自三种环境(不包括ICU)的胸部x线片,比较人工智能生成的报告和放射科医生撰写的报告的可接受性(广义线性混合模型)及其识别可参考异常的敏感性和特异性(McNemar试验)。对放射科医生小组进行了调查,以评估他们对人工智能生成报告取代放射科医生撰写报告的潜力的看法。结果共纳入胸片1539例(中位年龄55岁,男性656例,女性483例,性别不详400例)。在标准标准下,人工智能生成的报告和放射科医生撰写的报告的可接受性没有差异(88.4% vs 89.2%; P = 0.36),但人工智能生成的报告比严格标准下放射科医生撰写的报告的可接受性更低(66.8% vs 75.7%; P < .001)。与放射科医生撰写的报告相比,人工智能生成的报告识别可参考异常的x线片具有更高的敏感性(81.2%对59.4%,P < 0.001)和更低的特异性(81.0%对93.6%,P < 0.001)。在调查中,大多数放射科医生表示,人工智能生成的报告还不够可靠,无法取代放射科医生撰写的报告。结论:人工智能生成的胸片报告与放射科医生撰写的报告具有相似的可接受性,尽管人工智能生成的报告中有很大一部分需要进行少量修改。©RSNA, 2025本文可获得补充材料。另见吴和徐在本期的社论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。