Accuracy and reproducibility of large language model measurements of liver metastases: comparison with radiologist measurements.

IF 2.1 4区医学

Japanese Journal of Radiology Pub Date : 2025-10-04 DOI:10.1007/s11604-025-01884-5

Haruto Sugawara, Akiyo Takada, Shimpei Kato

{"title":"Accuracy and reproducibility of large language model measurements of liver metastases: comparison with radiologist measurements.","authors":"Haruto Sugawara, Akiyo Takada, Shimpei Kato","doi":"10.1007/s11604-025-01884-5","DOIUrl":null,"url":null,"abstract":"Purpose: To compare the accuracy and reproducibility of lesion-diameter measurements performed by three state-of-the-art LLMs with those obtained by radiologists.Materials and methods: In this retrospective study using a public database, 83 patients with solitary colorectal-cancer liver metastases were identified. From each CT series, a radiologist extracted the single axial slice showing the maximal tumor diameter and converted it to a 512 × 512-pixel PNG image (window level 50 HU, window width 400 HU) with pixel size encoded in the filename. Three LLMs-ChatGPT-o3 (OpenAI), Gemini 2.5 Pro (Google), and Claude 4 Opus (Anthropic)-were prompted to estimate the longest lesion diameter twice, ≥ 1 week apart. Two board-certified radiologists (12 years' experience each) independently measured the same single slice images and one radiologist repeated the measurements after ≥ 1 week. Agreement was assessed with intraclass correlation coefficients (ICC); 95% confidence intervals were obtained by bootstrap resampling (5 000 iterations).Results: Radiologist inter-observer agreement was excellent (ICC = 0.95, 95% CI 0.86-0.99); intra-observer agreement was 0.98 (95% CI 0.94-0.99). Gemini achieved good model-to-radiologist agreement (ICC = 0.81, 95% CI 0.68-0.89) and intra-model reproducibility (ICC = 0.78, 95% CI 0.65-0.87). GPT-o3 showed moderate agreement (ICC = 0.52) and poor reproducibility (ICC = 0.25); Claude showed poor agreement (ICC = 0.07) and reproducibility (ICC = 0.47).Conclusion: LLMs do not yet match radiologists in measuring colorectal cancer liver metastasis; however, Gemini's good agreement and reproducibility highlight the rapid progress of image interpretation capability of LLMs.","PeriodicalId":14691,"journal":{"name":"Japanese Journal of Radiology","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Japanese Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11604-025-01884-5","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To compare the accuracy and reproducibility of lesion-diameter measurements performed by three state-of-the-art LLMs with those obtained by radiologists.

Materials and methods: In this retrospective study using a public database, 83 patients with solitary colorectal-cancer liver metastases were identified. From each CT series, a radiologist extracted the single axial slice showing the maximal tumor diameter and converted it to a 512 × 512-pixel PNG image (window level 50 HU, window width 400 HU) with pixel size encoded in the filename. Three LLMs-ChatGPT-o3 (OpenAI), Gemini 2.5 Pro (Google), and Claude 4 Opus (Anthropic)-were prompted to estimate the longest lesion diameter twice, ≥ 1 week apart. Two board-certified radiologists (12 years' experience each) independently measured the same single slice images and one radiologist repeated the measurements after ≥ 1 week. Agreement was assessed with intraclass correlation coefficients (ICC); 95% confidence intervals were obtained by bootstrap resampling (5 000 iterations).

Results: Radiologist inter-observer agreement was excellent (ICC = 0.95, 95% CI 0.86-0.99); intra-observer agreement was 0.98 (95% CI 0.94-0.99). Gemini achieved good model-to-radiologist agreement (ICC = 0.81, 95% CI 0.68-0.89) and intra-model reproducibility (ICC = 0.78, 95% CI 0.65-0.87). GPT-o3 showed moderate agreement (ICC = 0.52) and poor reproducibility (ICC = 0.25); Claude showed poor agreement (ICC = 0.07) and reproducibility (ICC = 0.47).

Conclusion: LLMs do not yet match radiologists in measuring colorectal cancer liver metastasis; however, Gemini's good agreement and reproducibility highlight the rapid progress of image interpretation capability of LLMs.

查看原文本刊更多论文

肝转移大语言模型测量的准确性和可重复性：与放射科测量的比较。

目的：比较三个最先进的LLMs与放射科医生获得的病变直径测量的准确性和可重复性。材料和方法：在这项使用公共数据库的回顾性研究中，确定了83例孤立性结直肠癌肝转移患者。从每个CT序列中，放射科医生提取显示最大肿瘤直径的单轴切片，并将其转换为512 × 512像素的PNG图像（窗高50 HU，窗宽400 HU），像素大小在文件名中编码。三个llms - chatgpt - 03 (OpenAI), Gemini 2.5 Pro（谷歌）和Claude 4 Opus (Anthropic)-提示估计最长病变直径两次，间隔≥1周。两名委员会认证的放射科医生（每人有12年的经验）独立测量相同的单片图像，一名放射科医生在≥1周后重复测量。用类内相关系数（ICC）评估一致性；95%置信区间采用自举重采样（5 000次迭代）。结果：放射科医师间观察者一致性极好（ICC = 0.95, 95% CI 0.86-0.99）；观察者间一致性为0.98 （95% CI 0.94-0.99）。Gemini获得了良好的模型-放射科医师一致性（ICC = 0.81, 95% CI 0.68-0.89）和模型内可重复性（ICC = 0.78, 95% CI 0.65-0.87）。GPT-o3一致性中等（ICC = 0.52），重现性较差（ICC = 0.25）；Claude显示较差的一致性（ICC = 0.07）和可重复性（ICC = 0.47）。结论：LLMs在测量结直肠癌肝转移方面与放射科医师尚不一致；然而，Gemini良好的一致性和可重复性突出了llm图像判读能力的快速发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Japanese Journal of Radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

133

期刊介绍： Japanese Journal of Radiology is a peer-reviewed journal, officially published by the Japan Radiological Society. The main purpose of the journal is to provide a forum for the publication of papers documenting recent advances and new developments in the field of radiology in medicine and biology. The scope of Japanese Journal of Radiology encompasses but is not restricted to diagnostic radiology, interventional radiology, radiation oncology, nuclear medicine, radiation physics, and radiation biology. Additionally, the journal covers technical and industrial innovations. The journal welcomes original articles, technical notes, review articles, pictorial essays and letters to the editor. The journal also provides announcements from the boards and the committees of the society. Membership in the Japan Radiological Society is not a prerequisite for submission. Contributions are welcomed from all parts of the world.