Maximilian Schmutz, Sebastian Sommer, Julia Sander, David Graumann, Johannes Raffler, Iñaki Soto-Rey, Seyedmostafa Sheikhalishahi, Lisa Schmidt, Leonhard Paul Unkelbach, Levent Ortak, Tina Schaller, Sebastian Dintner, Kathrin Hildebrand, Michaela Kuhlen, Frank Jordan, Martin Trepel, Christian Hinske, Rainer Claus
{"title":"ChatGPT 4.0的大型语言模型处理能力,以生成分子肿瘤委员会建议-对现实世界数据的关键评估。","authors":"Maximilian Schmutz, Sebastian Sommer, Julia Sander, David Graumann, Johannes Raffler, Iñaki Soto-Rey, Seyedmostafa Sheikhalishahi, Lisa Schmidt, Leonhard Paul Unkelbach, Levent Ortak, Tina Schaller, Sebastian Dintner, Kathrin Hildebrand, Michaela Kuhlen, Frank Jordan, Martin Trepel, Christian Hinske, Rainer Claus","doi":"10.1093/oncolo/oyaf293","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0's performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.</p><p><strong>Methods: </strong>We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss' Kappa).</p><p><strong>Results: </strong>ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs. 1, p = 0.005), with comparable diagnostic suggestions (median 1 vs. 2, p = 0.501). Therapeutic scope from ChatGPT 4.0 included off-label and clinical trial options. IDM scores indicated similar content depth between ChatGPT 4.0 (median 0.67) and hMTB (median 0.75; p = 0.084). Moderate consistency was observed across replicate runs (median Fleiss' Kappa=0.51). ChatGPT 4.0 occasionally utilized lower-level or preclinical evidence more frequently (p = 0.0019). Efficiency favored ChatGPT 4.0 significantly (median 15.2 vs. 34.7 minutes; p < 0.001).</p><p><strong>Conclusion: </strong>Incorporating ChatGPT 4.0 into MTB workflows enhances efficiency and provides relevant recommendations, especially in guideline-supported cases. However, variability in evidence prioritization highlights the need for ongoing human oversight. A hybrid approach, integrating human expertise with LLM support, may optimize precision oncology decision-making.</p>","PeriodicalId":54686,"journal":{"name":"Oncologist","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language model processing capabilities of ChatGPT 4.0 to generate molecular tumor board recommendations-a critical evaluation on real world data.\",\"authors\":\"Maximilian Schmutz, Sebastian Sommer, Julia Sander, David Graumann, Johannes Raffler, Iñaki Soto-Rey, Seyedmostafa Sheikhalishahi, Lisa Schmidt, Leonhard Paul Unkelbach, Levent Ortak, Tina Schaller, Sebastian Dintner, Kathrin Hildebrand, Michaela Kuhlen, Frank Jordan, Martin Trepel, Christian Hinske, Rainer Claus\",\"doi\":\"10.1093/oncolo/oyaf293\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0's performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.</p><p><strong>Methods: </strong>We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss' Kappa).</p><p><strong>Results: </strong>ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs. 1, p = 0.005), with comparable diagnostic suggestions (median 1 vs. 2, p = 0.501). Therapeutic scope from ChatGPT 4.0 included off-label and clinical trial options. IDM scores indicated similar content depth between ChatGPT 4.0 (median 0.67) and hMTB (median 0.75; p = 0.084). Moderate consistency was observed across replicate runs (median Fleiss' Kappa=0.51). ChatGPT 4.0 occasionally utilized lower-level or preclinical evidence more frequently (p = 0.0019). Efficiency favored ChatGPT 4.0 significantly (median 15.2 vs. 34.7 minutes; p < 0.001).</p><p><strong>Conclusion: </strong>Incorporating ChatGPT 4.0 into MTB workflows enhances efficiency and provides relevant recommendations, especially in guideline-supported cases. However, variability in evidence prioritization highlights the need for ongoing human oversight. A hybrid approach, integrating human expertise with LLM support, may optimize precision oncology decision-making.</p>\",\"PeriodicalId\":54686,\"journal\":{\"name\":\"Oncologist\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Oncologist\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/oncolo/oyaf293\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Oncologist","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/oncolo/oyaf293","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}
Large language model processing capabilities of ChatGPT 4.0 to generate molecular tumor board recommendations-a critical evaluation on real world data.
Background: Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0's performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.
Methods: We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss' Kappa).
Results: ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs. 1, p = 0.005), with comparable diagnostic suggestions (median 1 vs. 2, p = 0.501). Therapeutic scope from ChatGPT 4.0 included off-label and clinical trial options. IDM scores indicated similar content depth between ChatGPT 4.0 (median 0.67) and hMTB (median 0.75; p = 0.084). Moderate consistency was observed across replicate runs (median Fleiss' Kappa=0.51). ChatGPT 4.0 occasionally utilized lower-level or preclinical evidence more frequently (p = 0.0019). Efficiency favored ChatGPT 4.0 significantly (median 15.2 vs. 34.7 minutes; p < 0.001).
Conclusion: Incorporating ChatGPT 4.0 into MTB workflows enhances efficiency and provides relevant recommendations, especially in guideline-supported cases. However, variability in evidence prioritization highlights the need for ongoing human oversight. A hybrid approach, integrating human expertise with LLM support, may optimize precision oncology decision-making.
期刊介绍:
The Oncologist® is dedicated to translating the latest research developments into the best multidimensional care for cancer patients. Thus, The Oncologist is committed to helping physicians excel in this ever-expanding environment through the publication of timely reviews, original studies, and commentaries on important developments. We believe that the practice of oncology requires both an understanding of a range of disciplines encompassing basic science related to cancer, translational research, and clinical practice, but also the socioeconomic and psychosocial factors that determine access to care and quality of life and function following cancer treatment.