Rajesh Bhayana, Ankush Jajodia, Tanya Chawla, Yangqing Deng, Genevieve Bouchard-Fortier, Masoom Haider, Satheesh Krishna
求助PDF
{"title":"Accuracy of Large Language Model-based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports.","authors":"Rajesh Bhayana, Ankush Jajodia, Tanya Chawla, Yangqing Deng, Genevieve Bouchard-Fortier, Masoom Haider, Satheesh Krishna","doi":"10.1148/radiol.241554","DOIUrl":null,"url":null,"abstract":"<p><p>Background Ovarian-Adnexal Reporting and Data System (O-RADS) for MRI helps assign malignancy risk, but radiologist adoption is inconsistent. Automatic assignment of O-RADS scores from reports could increase adoption and accuracy. Purpose To evaluate the accuracy of large language models (LLMs), after strategic optimization, for automatically calculating O-RADS scores from reports. Materials and Methods This retrospective single-center study from a large quaternary care cancer center included consecutive gadolinium chelate-enhanced pelvic MRI reports with at least one assigned O-RADS score from July 2021 to October 2023. Reports from January 2018 to October 2019 (before O-RADS MRI implementation) were randomly selected for additional testing. Reference standard O-RADS scores were determined by radiologists interpreting reports. After prompt optimization using a subset of reports, two LLM-based strategies were evaluated: few-shot learning with GPT-4 (version 0613; OpenAI) prompted with O-RADS rules (\"LLM only\") and a hybrid strategy leveraging GPT-4 to classify features fed into a deterministic formula (\"hybrid\"). Accuracy of each model and originally reported scores were calculated and compared using the McNemar test. Results A total of 284 reports from 284 female patients (mean age, 53.2 years ± 16.3 [SD]) with 372 adnexal lesions were included: 10 reports in the training set (16 lesions), 134 reports in the internal test set 1 (173 lesions; 158 O-RADS assigned), and 140 reports in internal test set 2 (183 lesions). For assigning O-RADS MRI scores, the hybrid model accuracy (97%; 168 of 173) outperformed LLM-only model (90%; 155 of 173; <i>P</i> = .006). For lesions with an originally reported O-RADS score, hybrid model accuracy exceeded that of reporting radiologists (97% [153 of 158] vs 88% [139 of 158]; <i>P</i> = .004). Hybrid model also outperformed LLM-only model for 183 lesions from before O-RADS implementation (95% [173 of 183] vs 87% [159 of 183], respectively; <i>P</i> = .01). Conclusion A hybrid LLM-based application, combining LLM feature classification with deterministic elements, accurately assigned O-RADS MRI scores from report descriptions, exceeding both an LLM-only strategy and the original reporting radiologist. © RSNA, 2025 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"315 1","pages":"e241554"},"PeriodicalIF":12.1000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241554","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Ovarian-Adnexal Reporting and Data System (O-RADS) for MRI helps assign malignancy risk, but radiologist adoption is inconsistent. Automatic assignment of O-RADS scores from reports could increase adoption and accuracy. Purpose To evaluate the accuracy of large language models (LLMs), after strategic optimization, for automatically calculating O-RADS scores from reports. Materials and Methods This retrospective single-center study from a large quaternary care cancer center included consecutive gadolinium chelate-enhanced pelvic MRI reports with at least one assigned O-RADS score from July 2021 to October 2023. Reports from January 2018 to October 2019 (before O-RADS MRI implementation) were randomly selected for additional testing. Reference standard O-RADS scores were determined by radiologists interpreting reports. After prompt optimization using a subset of reports, two LLM-based strategies were evaluated: few-shot learning with GPT-4 (version 0613; OpenAI) prompted with O-RADS rules ("LLM only") and a hybrid strategy leveraging GPT-4 to classify features fed into a deterministic formula ("hybrid"). Accuracy of each model and originally reported scores were calculated and compared using the McNemar test. Results A total of 284 reports from 284 female patients (mean age, 53.2 years ± 16.3 [SD]) with 372 adnexal lesions were included: 10 reports in the training set (16 lesions), 134 reports in the internal test set 1 (173 lesions; 158 O-RADS assigned), and 140 reports in internal test set 2 (183 lesions). For assigning O-RADS MRI scores, the hybrid model accuracy (97%; 168 of 173) outperformed LLM-only model (90%; 155 of 173; P = .006). For lesions with an originally reported O-RADS score, hybrid model accuracy exceeded that of reporting radiologists (97% [153 of 158] vs 88% [139 of 158]; P = .004). Hybrid model also outperformed LLM-only model for 183 lesions from before O-RADS implementation (95% [173 of 183] vs 87% [159 of 183], respectively; P = .01). Conclusion A hybrid LLM-based application, combining LLM feature classification with deterministic elements, accurately assigned O-RADS MRI scores from report descriptions, exceeding both an LLM-only strategy and the original reporting radiologist. © RSNA, 2025 Supplemental material is available for this article.