上下文学习使大型语言模型能够从合成CT和MRI报告中实现脊柱不稳定性肿瘤评分分类的人类水平。

IF 4.8 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Maximilian F Russe, Marco Reisert, Anna Fink, Marc Hohenhaus, Julia M Nakagawa, Caroline Wilpert, Carl P Simon, Elmar Kotter, Horst Urbach, Alexander Rau
{"title":"上下文学习使大型语言模型能够从合成CT和MRI报告中实现脊柱不稳定性肿瘤评分分类的人类水平。","authors":"Maximilian F Russe, Marco Reisert, Anna Fink, Marc Hohenhaus, Julia M Nakagawa, Caroline Wilpert, Carl P Simon, Elmar Kotter, Horst Urbach, Alexander Rau","doi":"10.1007/s11547-025-02096-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance.</p><p><strong>Material and methods: </strong>This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points.</p><p><strong>Results: </strong>Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation.</p><p><strong>Conclusion: </strong>In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.</p>","PeriodicalId":20817,"journal":{"name":"Radiologia Medica","volume":" ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.\",\"authors\":\"Maximilian F Russe, Marco Reisert, Anna Fink, Marc Hohenhaus, Julia M Nakagawa, Caroline Wilpert, Carl P Simon, Elmar Kotter, Horst Urbach, Alexander Rau\",\"doi\":\"10.1007/s11547-025-02096-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance.</p><p><strong>Material and methods: </strong>This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points.</p><p><strong>Results: </strong>Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation.</p><p><strong>Conclusion: </strong>In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.</p>\",\"PeriodicalId\":20817,\"journal\":{\"name\":\"Radiologia Medica\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiologia Medica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s11547-025-02096-7\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiologia Medica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11547-025-02096-7","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

目的:评估最先进的大型语言模型在使用脊柱不稳定性肿瘤评分(SINS)对椎体转移稳定性进行分类方面的表现,并与人类专家进行比较,并评估包括上下文学习在内的任务特定改进对其表现的影响。材料和方法:本回顾性研究分析了100份综合CT和MRI报告,包括广泛的SINS评分。四名人类专家(两名放射科医生和两名神经外科医生)和四种大型语言模型(Mistral, Claude, GPT-4 turbo和gpt - 40)评估了这些报告。大型语言模型以通用形式和特定于任务的细化进行了测试。性能评估基于正确的SINS类别分配和归属的SINS点。结果:人类专家在SINS分类(98.5%正确率)和点数计算(92%正确率)方面表现出较高的中位数性能,中位数点偏移为0[0-0]。通用的大型语言模型在26-63%的正确率和4-15%的正确率上表现不佳。上下文学习将聊天机器人的性能显著提高到接近人类的水平(分类正确96-98/100,评分正确86-95/100,与人类专家没有显著差异)。改进的大型语言模型在SINS点分配上的性能提高了71-85%。结论:上下文学习使最先进的大型语言模型能够在SINS分类中达到接近人类专家水平,为自动评估椎体转移稳定性提供了潜力。通用大型语言模型的糟糕表现突出了人工智能在医疗应用中特定任务细化的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
In-context learning enables large language models to achieve human-level performance in spinal instability neoplastic score classification from synthetic CT and MRI reports.

Purpose: To assess the performance of state-of-the-art large language models in classifying vertebral metastasis stability using the Spinal Instability Neoplastic Score (SINS) compared to human experts, and to evaluate the impact of task-specific refinement including in-context learning on their performance.

Material and methods: This retrospective study analyzed 100 synthetic CT and MRI reports encompassing a broad range of SINS scores. Four human experts (two radiologists and two neurosurgeons) and four large language models (Mistral, Claude, GPT-4 turbo, and GPT-4o) evaluated the reports. Large language models were tested in both generic form and with task-specific refinement. Performance was assessed based on correct SINS category assignment and attributed SINS points.

Results: Human experts demonstrated high median performance in SINS classification (98.5% correct) and points calculation (92% correct), with a median point offset of 0 [0-0]. Generic large language models performed poorly with 26-63% correct category and 4-15% correct SINS points allocation. In-context learning significantly improved chatbot performance to near-human levels (96-98/100 correct for classification, 86-95/100 for scoring, no significant difference to human experts). Refined large language models performed 71-85% better in SINS points allocation.

Conclusion: In-context learning enables state-of-the-art large language models to perform at near-human expert levels in SINS classification, offering potential for automating vertebral metastasis stability assessment. The poor performance of generic large language models highlights the importance of task-specific refinement in medical applications of artificial intelligence.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Radiologia Medica
Radiologia Medica 医学-核医学
CiteScore
14.10
自引率
7.90%
发文量
133
审稿时长
4-8 weeks
期刊介绍: Felice Perussia founded La radiologia medica in 1914. It is a peer-reviewed journal and serves as the official journal of the Italian Society of Medical and Interventional Radiology (SIRM). The primary purpose of the journal is to disseminate information related to Radiology, especially advancements in diagnostic imaging and related disciplines. La radiologia medica welcomes original research on both fundamental and clinical aspects of modern radiology, with a particular focus on diagnostic and interventional imaging techniques. It also covers topics such as radiotherapy, nuclear medicine, radiobiology, health physics, and artificial intelligence in the context of clinical implications. The journal includes various types of contributions such as original articles, review articles, editorials, short reports, and letters to the editor. With an esteemed Editorial Board and a selection of insightful reports, the journal is an indispensable resource for radiologists and professionals in related fields. Ultimately, La radiologia medica aims to serve as a platform for international collaboration and knowledge sharing within the radiological community.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信