In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.

IF 4.8 1区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiologia Medica Pub Date : 2025-09-24 DOI:10.1007/s11547-025-02096-7

Maximilian F Russe, Marco Reisert, Anna Fink, Marc Hohenhaus, Julia M Nakagawa, Caroline Wilpert, Carl P Simon, Elmar Kotter, Horst Urbach, Alexander Rau

{"title":"In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.","authors":"Maximilian F Russe, Marco Reisert, Anna Fink, Marc Hohenhaus, Julia M Nakagawa, Caroline Wilpert, Carl P Simon, Elmar Kotter, Horst Urbach, Alexander Rau","doi":"10.1007/s11547-025-02096-7","DOIUrl":null,"url":null,"abstract":"Purpose: To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance.Material and methods: This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points.Results: Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation.Conclusion: In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.","PeriodicalId":20817,"journal":{"name":"Radiologia Medica","volume":" ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiologia Medica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11547-025-02096-7","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance.

Material and methods: This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points.

Results: Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation.

Conclusion: In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.

查看原文本刊更多论文

上下文学习使大型语言模型能够从合成CT和MRI报告中实现脊柱不稳定性肿瘤评分分类的人类水平。

目的：评估最先进的大型语言模型在使用脊柱不稳定性肿瘤评分（SINS）对椎体转移稳定性进行分类方面的表现，并与人类专家进行比较，并评估包括上下文学习在内的任务特定改进对其表现的影响。材料和方法：本回顾性研究分析了100份综合CT和MRI报告，包括广泛的SINS评分。四名人类专家（两名放射科医生和两名神经外科医生）和四种大型语言模型（Mistral, Claude， GPT-4 turbo和gpt - 40）评估了这些报告。大型语言模型以通用形式和特定于任务的细化进行了测试。性能评估基于正确的SINS类别分配和归属的SINS点。结果：人类专家在SINS分类（98.5%正确率）和点数计算（92%正确率）方面表现出较高的中位数性能，中位数点偏移为0[0-0]。通用的大型语言模型在26-63%的正确率和4-15%的正确率上表现不佳。上下文学习将聊天机器人的性能显著提高到接近人类的水平（分类正确96-98/100，评分正确86-95/100，与人类专家没有显著差异）。改进的大型语言模型在SINS点分配上的性能提高了71-85%。结论：上下文学习使最先进的大型语言模型能够在SINS分类中达到接近人类专家水平，为自动评估椎体转移稳定性提供了潜力。通用大型语言模型的糟糕表现突出了人工智能在医疗应用中特定任务细化的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiologia Medica 医学-核医学

CiteScore

14.10

自引率

7.90%

发文量

133

审稿时长

4-8 weeks

期刊介绍： Felice Perussia founded La radiologia medica in 1914. It is a peer-reviewed journal and serves as the official journal of the Italian Society of Medical and Interventional Radiology (SIRM). The primary purpose of the journal is to disseminate information related to Radiology, especially advancements in diagnostic imaging and related disciplines. La radiologia medica welcomes original research on both fundamental and clinical aspects of modern radiology, with a particular focus on diagnostic and interventional imaging techniques. It also covers topics such as radiotherapy, nuclear medicine, radiobiology, health physics, and artificial intelligence in the context of clinical implications. The journal includes various types of contributions such as original articles, review articles, editorials, short reports, and letters to the editor. With an esteemed Editorial Board and a selection of insightful reports, the journal is an indispensable resource for radiologists and professionals in related fields. Ultimately, La radiologia medica aims to serve as a platform for international collaboration and knowledge sharing within the radiological community.