Diagnostic Accuracy and Clinical Value of a Domain-specific Multimodal Generative AI Model for Chest Radiograph Report Generation.
IF 12.1
1区 医学
Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Eun Kyoung Hong, Jiyeon Ham, Byungseok Roh, Jawook Gu, Beomhee Park, Sunghun Kang, Kihyun You, Jihwan Eom, Byeonguk Bae, Jae-Bock Jo, Ok Kyu Song, Woong Bae, Ro Woon Lee, Chong Hyun Suh, Chan Ho Park, Seong Jun Choi, Jai Soung Park, Jae-Hyeong Park, Hyun Jeong Jeon, Jeong-Ho Hong, Dosang Cho, Han Seok Choi, Tae Hee Kim
求助PDF
{"title":"Diagnostic Accuracy and Clinical Value of a Domain-specific Multimodal Generative AI Model for Chest Radiograph Report Generation.","authors":"Eun Kyoung Hong, Jiyeon Ham, Byungseok Roh, Jawook Gu, Beomhee Park, Sunghun Kang, Kihyun You, Jihwan Eom, Byeonguk Bae, Jae-Bock Jo, Ok Kyu Song, Woong Bae, Ro Woon Lee, Chong Hyun Suh, Chan Ho Park, Seong Jun Choi, Jai Soung Park, Jae-Hyeong Park, Hyun Jeong Jeon, Jeong-Ho Hong, Dosang Cho, Han Seok Choi, Tae Hee Kim","doi":"10.1148/radiol.241476","DOIUrl":null,"url":null,"abstract":"<p><p>Background Generative artificial intelligence (AI) is anticipated to alter radiology workflows, requiring a clinical value assessment for frequent examinations like chest radiograph interpretation. Purpose To develop and evaluate the diagnostic accuracy and clinical value of a domain-specific multimodal generative AI model for providing preliminary interpretations of chest radiographs. Materials and Methods For training, consecutive radiograph-report pairs from frontal chest radiography were retrospectively collected from 42 hospitals (2005-2023). The trained domain-specific AI model generated radiology reports for the radiographs. The test set included public datasets (PadChest, Open-i, VinDr-CXR, and MIMIC-CXR-JPG) and radiographs excluded from training. The sensitivity and specificity of the model-generated reports for 13 radiographic findings, compared with radiologist annotations (reference standard), were calculated (with 95% CIs). Four radiologists evaluated the subjective quality of the reports in terms of acceptability, agreement score, quality score, and comparative ranking of reports from <i>(a)</i> the domain-specific AI model, <i>(b)</i> radiologists, and <i>(c)</i> a general-purpose large language model (GPT-4Vision). Acceptability was defined as whether the radiologist would endorse the report as their own without changes. Agreement scores from 1 (clinically significant discrepancy) to 5 (complete agreement) were assigned using RADPEER; quality scores were on a 5-point Likert scale from 1 (very poor) to 5 (excellent). Results A total of 8 838 719 radiograph-report pairs (training) and 2145 radiographs (testing) were included (anonymized with respect to sex and gender). Reports generated by the domain-specific AI model demonstrated high sensitivity for detecting two critical radiographic findings: 95.3% (181 of 190) for pneumothorax and 92.6% (138 of 149) for subcutaneous emphysema. Acceptance rate, evaluated by four radiologists, was 70.5% (6047 of 8680), 73.3% (6288 of 8580), and 29.6% (2536 of 8580) for model-generated, radiologist, and GPT-4Vision reports, respectively. Agreement scores were highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for GPT-4Vision reports (median = 1 [IQR, 1-3]; <i>P</i> < .001). Quality scores were also highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for the GPT-4Vision reports (median = 2 [IQR, 1-3]; <i>P</i> < .001). From the ranking analysis, model-generated reports were most frequently ranked the highest (60.0%; 5146 of 8580), and GPT-4Vision reports were most frequently ranked the lowest (73.6%; 6312 of 8580). Conclusion A domain-specific multimodal generative AI model demonstrated potential for high diagnostic accuracy and clinical value in providing preliminary interpretations of chest radiographs for radiologists. © RSNA, 2025 <i>Supplemental material is available for this article.</i> See also the editorial by Little in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 3","pages":"e241476"},"PeriodicalIF":12.1000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241476","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Generative artificial intelligence (AI) is anticipated to alter radiology workflows, requiring a clinical value assessment for frequent examinations like chest radiograph interpretation. Purpose To develop and evaluate the diagnostic accuracy and clinical value of a domain-specific multimodal generative AI model for providing preliminary interpretations of chest radiographs. Materials and Methods For training, consecutive radiograph-report pairs from frontal chest radiography were retrospectively collected from 42 hospitals (2005-2023). The trained domain-specific AI model generated radiology reports for the radiographs. The test set included public datasets (PadChest, Open-i, VinDr-CXR, and MIMIC-CXR-JPG) and radiographs excluded from training. The sensitivity and specificity of the model-generated reports for 13 radiographic findings, compared with radiologist annotations (reference standard), were calculated (with 95% CIs). Four radiologists evaluated the subjective quality of the reports in terms of acceptability, agreement score, quality score, and comparative ranking of reports from (a) the domain-specific AI model, (b) radiologists, and (c) a general-purpose large language model (GPT-4Vision). Acceptability was defined as whether the radiologist would endorse the report as their own without changes. Agreement scores from 1 (clinically significant discrepancy) to 5 (complete agreement) were assigned using RADPEER; quality scores were on a 5-point Likert scale from 1 (very poor) to 5 (excellent). Results A total of 8 838 719 radiograph-report pairs (training) and 2145 radiographs (testing) were included (anonymized with respect to sex and gender). Reports generated by the domain-specific AI model demonstrated high sensitivity for detecting two critical radiographic findings: 95.3% (181 of 190) for pneumothorax and 92.6% (138 of 149) for subcutaneous emphysema. Acceptance rate, evaluated by four radiologists, was 70.5% (6047 of 8680), 73.3% (6288 of 8580), and 29.6% (2536 of 8580) for model-generated, radiologist, and GPT-4Vision reports, respectively. Agreement scores were highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for GPT-4Vision reports (median = 1 [IQR, 1-3]; P < .001). Quality scores were also highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for the GPT-4Vision reports (median = 2 [IQR, 1-3]; P < .001). From the ranking analysis, model-generated reports were most frequently ranked the highest (60.0%; 5146 of 8580), and GPT-4Vision reports were most frequently ranked the lowest (73.6%; 6312 of 8580). Conclusion A domain-specific multimodal generative AI model demonstrated potential for high diagnostic accuracy and clinical value in providing preliminary interpretations of chest radiographs for radiologists. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Little in this issue.