LLM-powered breast cancer staging from PET/CT reports: a comparative performance study

IF 4.1 2区医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Medical Informatics Pub Date : 2025-07-19 DOI:10.1016/j.ijmedinf.2025.106053

Daniel Spitzl , Markus Mergen , Rickmer Braren , Lukas Endrös , Matthias Eiber , Lisa Steinhelfer

{"title":"LLM-powered breast cancer staging from PET/CT reports: a comparative performance study","authors":"Daniel Spitzl , Markus Mergen , Rickmer Braren , Lukas Endrös , Matthias Eiber , Lisa Steinhelfer","doi":"10.1016/j.ijmedinf.2025.106053","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>Imaging reports are crucial in breast cancer management, with the tumor-node-metastasis (TNM) classification serving as a widely used model for assessing disease severity, guiding treatment decisions, and predicting patient outcomes. Large language models (LLMs) offer a potential solution by extracting standardized UICC TNM classifications and the corresponding UICC stage directly from existing PET/CT reports. This approach holds promise to enhance staging accuracy, streamline multidisciplinary discussions, and improve patient outcomes.</div></div><div><h3>Methods</h3><div>Here, we evaluated four LLMs—ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash—for their capacity to determine TNM staging based on UICC/AJCC breast cancer guidelines. A total of 111 fictitious PET/CT reports were analyzed, and each model’s outputs were measured against expert-generated TNM classifications and stage categorizations.</div></div><div><h3>Results</h3><div>Among the tested models, Claude 3.5 Sonnet demonstrated superior F1 scores of 0.95%, 0.95%, 1.00% and 0.92% for T, N, M classification and UICC stage classification, respectively.</div></div><div><h3>Conclusions</h3><div>These findings underscore the ability of advanced natural language processing (NLP) technologies to support reliable cancer staging, potentially aiding clinicians. Despite the encouraging performance, prospective clinical trials and validation across diverse practice settings remain critical to confirming these preliminary outcomes. Nonetheless, this study highlights the promise of LLM-based systems in reinforcing the accuracy of oncologic workflows and lays the groundwork for broader adoption of AI-driven tools in breast cancer management.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"204 ","pages":"Article 106053"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625002709","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

Imaging reports are crucial in breast cancer management, with the tumor-node-metastasis (TNM) classification serving as a widely used model for assessing disease severity, guiding treatment decisions, and predicting patient outcomes. Large language models (LLMs) offer a potential solution by extracting standardized UICC TNM classifications and the corresponding UICC stage directly from existing PET/CT reports. This approach holds promise to enhance staging accuracy, streamline multidisciplinary discussions, and improve patient outcomes.

Methods

Here, we evaluated four LLMs—ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash—for their capacity to determine TNM staging based on UICC/AJCC breast cancer guidelines. A total of 111 fictitious PET/CT reports were analyzed, and each model’s outputs were measured against expert-generated TNM classifications and stage categorizations.

Results

Among the tested models, Claude 3.5 Sonnet demonstrated superior F1 scores of 0.95%, 0.95%, 1.00% and 0.92% for T, N, M classification and UICC stage classification, respectively.

Conclusions

These findings underscore the ability of advanced natural language processing (NLP) technologies to support reliable cancer staging, potentially aiding clinicians. Despite the encouraging performance, prospective clinical trials and validation across diverse practice settings remain critical to confirming these preliminary outcomes. Nonetheless, this study highlights the promise of LLM-based systems in reinforcing the accuracy of oncologic workflows and lays the groundwork for broader adoption of AI-driven tools in breast cancer management.

Abstract Image

查看原文本刊更多论文

从PET/CT报告中获得llm支持的乳腺癌分期：一项比较性能研究

目的影像学报告在乳腺癌治疗中至关重要，肿瘤-淋巴结-转移（TNM）分类是一种广泛使用的评估疾病严重程度、指导治疗决策和预测患者预后的模型。大型语言模型（llm）通过直接从现有的PET/CT报告中提取标准化的UICC TNM分类和相应的UICC阶段，提供了一种潜在的解决方案。这种方法有望提高分期准确性，简化多学科讨论，改善患者预后。方法在此，我们评估了四个llms - chatgpt - 40， DeepSeek V3, Claude 3.5 Sonnet和Gemini 2.0 flash -根据UICC/AJCC乳腺癌指南确定TNM分期的能力。总共分析了111个虚构的PET/CT报告，并根据专家生成的TNM分类和阶段分类对每个模型的输出进行了测量。结果经检验的模型中，Claude 3.5 Sonnet在T、N、M分类和UICC分期分类上的F1得分分别为0.95%、0.95%、1.00%和0.92%。这些发现强调了先进的自然语言处理（NLP）技术支持可靠的癌症分期的能力，可能有助于临床医生。尽管取得了令人鼓舞的成绩，但在不同的实践环境中进行前瞻性临床试验和验证仍然是确认这些初步结果的关键。尽管如此，这项研究强调了基于法学硕士的系统在加强肿瘤工作流程准确性方面的前景，并为在乳腺癌管理中更广泛地采用人工智能驱动的工具奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Medical Informatics 医学-计算机：信息系统

CiteScore

8.90

自引率

4.10%

发文量

217

审稿时长

42 days

期刊介绍： International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.