Harmonizing organ-at-risk structure names using open-source large language models

IF 3.3 Q2 ONCOLOGY
Adrian Thummerer , Matteo Maspero , Erik van der Bijl , Stefanie Corradini , Claus Belka , Guillaume Landry , Christopher Kurz
{"title":"Harmonizing organ-at-risk structure names using open-source large language models","authors":"Adrian Thummerer ,&nbsp;Matteo Maspero ,&nbsp;Erik van der Bijl ,&nbsp;Stefanie Corradini ,&nbsp;Claus Belka ,&nbsp;Guillaume Landry ,&nbsp;Christopher Kurz","doi":"10.1016/j.phro.2025.100813","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and purpose</h3><div>Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset.</div></div><div><h3>Materials and methods</h3><div>Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation.</div></div><div><h3>Results</h3><div>High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient &lt; 0.42).</div></div><div><h3>Conclusions</h3><div>Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.</div></div>","PeriodicalId":36850,"journal":{"name":"Physics and Imaging in Radiation Oncology","volume":"35 ","pages":"Article 100813"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physics and Imaging in Radiation Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405631625001186","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background and purpose

Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset.

Materials and methods

Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation.

Results

High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42).

Conclusions

Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.
使用开源大型语言模型协调有风险的器官结构名称
背景和目的标准化放疗结构命名对于放射肿瘤学的自动化、机构间合作和大规模深度学习研究至关重要。尽管有了命名指南(AAPM-TG-263),但它们的实施仍然缺乏,并且仍然面临挑战。本研究评估了开源大型语言模型(llm)在多机构和多语言数据集上的自动风险器官(OAR)重命名。材料和方法使用来自三所大学医学中心的1684名患者的34177个桨结构数据集,对四个开源llm (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1)进行评估,并使用手动TG-263 ground-truth标签。LLM重命名使用了几次提示技术,包括详细说明和通用示例。通过计算整个数据集和唯一数据集(删除重复项)的重命名准确性来评估性能。此外,我们还进行了失效分析、基于提示的置信度相关性和基于蒙特卡罗采样的不确定性估计。结果获得了较高的重命名准确率,其中推理增强的DeepSeek R1模型表现最佳(唯一准确率为98.6%,整体准确率为99.9%)。总体而言,推理模型的表现优于非推理模型。与基于提示的置信度估计(相关系数<;)相比,蒙特卡罗采样与预测误差的相关性更强(DeepSeek R1的相关系数为0.70),错误检测效果更好(灵敏度0.73,特异性1.0)。0.42)。开源法学硕士,特别是那些具有推理能力的法学硕士,可以根据TG-263在不同的多语言和多机构数据集上准确地协调OAR命名。它们还可以促进TG-263术语的采用,并为研究和人工智能开发创建大型标准化数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Physics and Imaging in Radiation Oncology
Physics and Imaging in Radiation Oncology Physics and Astronomy-Radiation
CiteScore
5.30
自引率
18.90%
发文量
93
审稿时长
6 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信