Harmonizing organ-at-risk structure names using open-source large language models

IF 3.3 Q2 ONCOLOGY

Physics and Imaging in Radiation Oncology Pub Date : 2025-07-01 DOI:10.1016/j.phro.2025.100813

Adrian Thummerer , Matteo Maspero , Erik van der Bijl , Stefanie Corradini , Claus Belka , Guillaume Landry , Christopher Kurz

{"title":"Harmonizing organ-at-risk structure names using open-source large language models","authors":"Adrian Thummerer , Matteo Maspero , Erik van der Bijl , Stefanie Corradini , Claus Belka , Guillaume Landry , Christopher Kurz","doi":"10.1016/j.phro.2025.100813","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and purpose</h3><div>Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset.</div></div><div><h3>Materials and methods</h3><div>Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation.</div></div><div><h3>Results</h3><div>High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42).</div></div><div><h3>Conclusions</h3><div>Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.</div></div>","PeriodicalId":36850,"journal":{"name":"Physics and Imaging in Radiation Oncology","volume":"35 ","pages":"Article 100813"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physics and Imaging in Radiation Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405631625001186","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background and purpose

Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset.

Materials and methods

Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation.

Results

High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42).

Conclusions

Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.

查看原文本刊更多论文

使用开源大型语言模型协调有风险的器官结构名称

背景和目的标准化放疗结构命名对于放射肿瘤学的自动化、机构间合作和大规模深度学习研究至关重要。尽管有了命名指南（AAPM-TG-263），但它们的实施仍然缺乏，并且仍然面临挑战。本研究评估了开源大型语言模型（llm）在多机构和多语言数据集上的自动风险器官（OAR）重命名。材料和方法使用来自三所大学医学中心的1684名患者的34177个桨结构数据集，对四个开源llm （Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1）进行评估，并使用手动TG-263 ground-truth标签。LLM重命名使用了几次提示技术，包括详细说明和通用示例。通过计算整个数据集和唯一数据集（删除重复项）的重命名准确性来评估性能。此外，我们还进行了失效分析、基于提示的置信度相关性和基于蒙特卡罗采样的不确定性估计。结果获得了较高的重命名准确率，其中推理增强的DeepSeek R1模型表现最佳（唯一准确率为98.6%，整体准确率为99.9%）。总体而言，推理模型的表现优于非推理模型。与基于提示的置信度估计（相关系数<；）相比，蒙特卡罗采样与预测误差的相关性更强（DeepSeek R1的相关系数为0.70），错误检测效果更好（灵敏度0.73，特异性1.0）。0.42)。开源法学硕士，特别是那些具有推理能力的法学硕士，可以根据TG-263在不同的多语言和多机构数据集上准确地协调OAR命名。它们还可以促进TG-263术语的采用，并为研究和人工智能开发创建大型标准化数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊