Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report
Jason Holmes PhD , Lian Zhang PhD , Yuzhen Ding PhD , Hongying Feng PhD , Zhengliang Liu MS , Tianming Liu PhD , William W. Wong MD , Sujay A. Vora MD , Jonathan B. Ashman MD, PhD , Wei Liu PhD
{"title":"Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report","authors":"Jason Holmes PhD , Lian Zhang PhD , Yuzhen Ding PhD , Hongying Feng PhD , Zhengliang Liu MS , Tianming Liu PhD , William W. Wong MD , Sujay A. Vora MD , Jonathan B. Ashman MD, PhD , Wei Liu PhD","doi":"10.1016/j.prro.2024.04.017","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To introduce the concept of using large language models (LLMs) to relabel structure names in accordance with the American Association of Physicists in Medicine Task Group-263 standard and to establish a benchmark for future studies to reference.</div></div><div><h3>Methods and Materials</h3><div>Generative Pretrained Transformer (GPT)-4 was implemented within a Digital Imaging and Communications in Medicine server. Upon receiving a structure-set Digital Imaging and Communications in Medicine file, the server prompts GPT-4 to relabel the structure names according to the American Association of Physicists in Medicine Task Group-263 report. The results were evaluated for 3 disease sites: prostate, head and neck, and thorax. For each disease site, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50), and 50 patients were randomly selected for evaluation. Structure names considered were those that were most likely to be relevant for studies using structure contours for many patients.</div></div><div><h3>Results</h3><div>The per-patient accuracy was 97.2%, 98.3%, and 97.1% for prostate, head and neck, and thorax disease sites, respectively. On a per-structure basis, the clinical target volume was relabeled correctly in 100%, 95.3%, and 92.9% of cases, respectively.</div></div><div><h3>Conclusions</h3><div>Given the accuracy of GPT-4 in relabeling structure names as presented in this work, LLMs are poised to become an important method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.</div></div>","PeriodicalId":54245,"journal":{"name":"Practical Radiation Oncology","volume":"14 6","pages":"Pages e515-e521"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Practical Radiation Oncology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1879850024000985","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
To introduce the concept of using large language models (LLMs) to relabel structure names in accordance with the American Association of Physicists in Medicine Task Group-263 standard and to establish a benchmark for future studies to reference.
Methods and Materials
Generative Pretrained Transformer (GPT)-4 was implemented within a Digital Imaging and Communications in Medicine server. Upon receiving a structure-set Digital Imaging and Communications in Medicine file, the server prompts GPT-4 to relabel the structure names according to the American Association of Physicists in Medicine Task Group-263 report. The results were evaluated for 3 disease sites: prostate, head and neck, and thorax. For each disease site, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50), and 50 patients were randomly selected for evaluation. Structure names considered were those that were most likely to be relevant for studies using structure contours for many patients.
Results
The per-patient accuracy was 97.2%, 98.3%, and 97.1% for prostate, head and neck, and thorax disease sites, respectively. On a per-structure basis, the clinical target volume was relabeled correctly in 100%, 95.3%, and 92.9% of cases, respectively.
Conclusions
Given the accuracy of GPT-4 in relabeling structure names as presented in this work, LLMs are poised to become an important method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.
期刊介绍:
The overarching mission of Practical Radiation Oncology is to improve the quality of radiation oncology practice. PRO''s purpose is to document the state of current practice, providing background for those in training and continuing education for practitioners, through discussion and illustration of new techniques, evaluation of current practices, and publication of case reports. PRO strives to provide its readers content that emphasizes knowledge "with a purpose." The content of PRO includes:
Original articles focusing on patient safety, quality measurement, or quality improvement initiatives
Original articles focusing on imaging, contouring, target delineation, simulation, treatment planning, immobilization, organ motion, and other practical issues
ASTRO guidelines, position papers, and consensus statements
Essays that highlight enriching personal experiences in caring for cancer patients and their families.