Jong Eun Lee, Ki-Seong Park, Yun-Hyeon Kim, Ho-Chun Song, Byunggeon Park, Yeon Joo Jeong
{"title":"使用胸部 CT 和 FDG PET/CT 自由文本报告进行肺癌分期:三种 ChatGPT 大语言模型与六位经验各异的人类读者之间的比较。","authors":"Jong Eun Lee, Ki-Seong Park, Yun-Hyeon Kim, Ho-Chun Song, Byunggeon Park, Yeon Joo Jeong","doi":"10.2214/AJR.24.31696","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background:</b> Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. <b>Objective:</b> To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. <b>Methods:</b> This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. <b>Results:</b> GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). <b>Conclusions:</b> The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. <b>Clinical Impact:</b> The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.</p>","PeriodicalId":55529,"journal":{"name":"American Journal of Roentgenology","volume":null,"pages":null},"PeriodicalIF":4.7000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large-Language Models and Six Human Readers of Varying Experience.\",\"authors\":\"Jong Eun Lee, Ki-Seong Park, Yun-Hyeon Kim, Ho-Chun Song, Byunggeon Park, Yeon Joo Jeong\",\"doi\":\"10.2214/AJR.24.31696\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Background:</b> Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. <b>Objective:</b> To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. <b>Methods:</b> This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. <b>Results:</b> GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). <b>Conclusions:</b> The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. <b>Clinical Impact:</b> The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.</p>\",\"PeriodicalId\":55529,\"journal\":{\"name\":\"American Journal of Roentgenology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Roentgenology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2214/AJR.24.31696\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Roentgenology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2214/AJR.24.31696","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large-Language Models and Six Human Readers of Varying Experience.
Background: Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. Objective: To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. Methods: This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. Results: GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). Conclusions: The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. Clinical Impact: The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.
期刊介绍:
Founded in 1907, the monthly American Journal of Roentgenology (AJR) is the world’s longest continuously published general radiology journal. AJR is recognized as among the specialty’s leading peer-reviewed journals and has a worldwide circulation of close to 25,000. The journal publishes clinically-oriented articles across all radiology subspecialties, seeking relevance to radiologists’ daily practice. The journal publishes hundreds of articles annually with a diverse range of formats, including original research, reviews, clinical perspectives, editorials, and other short reports. The journal engages its audience through a spectrum of social media and digital communication activities.