{"title":"头颈部肿瘤分期的大语言模型的证实。","authors":"Mehmet Kayaalp, Hatice Bölek, Hatime Arzu Yaşar","doi":"10.3390/diagnostics15182375","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background/Objectives</b>: Head and neck cancer (HNC) is a heterogeneous group of malignancies in which staging plays a critical role in guiding treatment and prognosis. Large language models (LLMs) such as ChatGPT, DeepSeek, and Grok have emerged as potential tools in oncology, yet their clinical applicability in staging remains unclear. This study aimed to evaluate the accuracy and concordance of LLMs compared to clinician-assigned staging in patients with HNC. <b>Methods</b>: The medical records of 202 patients with HNC, who presented to our center between 1 January 2010 and 13 February 2025, were retrospectively reviewed. The information obtained from the hospital information system by a junior researcher was re-evaluated by a senior researcher, and standard staging was completed. Except for the stage itself, the data used for staging were provided to a blinded third researcher, who then entered them into the ChatGPT, DeepSeek, and Grok applications with a staging command. After all staging processes were completed, the data were compiled, and clinician-assigned stages were compared with those generated by the LLMs. <b>Results</b>: The majority of the patients had laryngeal (45.5%) and nasopharyngeal cancer (21.3%). Definitive surgery was performed in 39.6% of the patients. Stage 4 was the most common stage among the patients (54%). The overall concordance rates, Cohen's kappa values, and F1 scores were 85.6%, 0.797, and 0.84 for ChatGPT; 67.3%, 0.522, and 0.65 for DeepSeek; and 75.2%, 0.614, and 0.72 for Grok, respectively, with no statistically significant differences between models. Pathological and surgical staging were found to be similar in terms of concordance. The concordance of assessments utilizing only imaging, only pathology notes, only physical examination notes, and comprehensive information was evaluated, revealing no significant differences. <b>Conclusions</b>: Large language models (LLMs) demonstrate relatively high accuracy in staging HNC. With careful implementation and with the consideration of prospective studies, these models have the potential to become valuable tools in oncology practice.</p>","PeriodicalId":11225,"journal":{"name":"Diagnostics","volume":"15 18","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468830/pdf/","citationCount":"0","resultStr":"{\"title\":\"Confirmation of Large Language Models in Head and Neck Cancer Staging.\",\"authors\":\"Mehmet Kayaalp, Hatice Bölek, Hatime Arzu Yaşar\",\"doi\":\"10.3390/diagnostics15182375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Background/Objectives</b>: Head and neck cancer (HNC) is a heterogeneous group of malignancies in which staging plays a critical role in guiding treatment and prognosis. Large language models (LLMs) such as ChatGPT, DeepSeek, and Grok have emerged as potential tools in oncology, yet their clinical applicability in staging remains unclear. This study aimed to evaluate the accuracy and concordance of LLMs compared to clinician-assigned staging in patients with HNC. <b>Methods</b>: The medical records of 202 patients with HNC, who presented to our center between 1 January 2010 and 13 February 2025, were retrospectively reviewed. The information obtained from the hospital information system by a junior researcher was re-evaluated by a senior researcher, and standard staging was completed. Except for the stage itself, the data used for staging were provided to a blinded third researcher, who then entered them into the ChatGPT, DeepSeek, and Grok applications with a staging command. After all staging processes were completed, the data were compiled, and clinician-assigned stages were compared with those generated by the LLMs. <b>Results</b>: The majority of the patients had laryngeal (45.5%) and nasopharyngeal cancer (21.3%). Definitive surgery was performed in 39.6% of the patients. Stage 4 was the most common stage among the patients (54%). The overall concordance rates, Cohen's kappa values, and F1 scores were 85.6%, 0.797, and 0.84 for ChatGPT; 67.3%, 0.522, and 0.65 for DeepSeek; and 75.2%, 0.614, and 0.72 for Grok, respectively, with no statistically significant differences between models. Pathological and surgical staging were found to be similar in terms of concordance. The concordance of assessments utilizing only imaging, only pathology notes, only physical examination notes, and comprehensive information was evaluated, revealing no significant differences. <b>Conclusions</b>: Large language models (LLMs) demonstrate relatively high accuracy in staging HNC. With careful implementation and with the consideration of prospective studies, these models have the potential to become valuable tools in oncology practice.</p>\",\"PeriodicalId\":11225,\"journal\":{\"name\":\"Diagnostics\",\"volume\":\"15 18\",\"pages\":\"\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468830/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Diagnostics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3390/diagnostics15182375\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/diagnostics15182375","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
Confirmation of Large Language Models in Head and Neck Cancer Staging.
Background/Objectives: Head and neck cancer (HNC) is a heterogeneous group of malignancies in which staging plays a critical role in guiding treatment and prognosis. Large language models (LLMs) such as ChatGPT, DeepSeek, and Grok have emerged as potential tools in oncology, yet their clinical applicability in staging remains unclear. This study aimed to evaluate the accuracy and concordance of LLMs compared to clinician-assigned staging in patients with HNC. Methods: The medical records of 202 patients with HNC, who presented to our center between 1 January 2010 and 13 February 2025, were retrospectively reviewed. The information obtained from the hospital information system by a junior researcher was re-evaluated by a senior researcher, and standard staging was completed. Except for the stage itself, the data used for staging were provided to a blinded third researcher, who then entered them into the ChatGPT, DeepSeek, and Grok applications with a staging command. After all staging processes were completed, the data were compiled, and clinician-assigned stages were compared with those generated by the LLMs. Results: The majority of the patients had laryngeal (45.5%) and nasopharyngeal cancer (21.3%). Definitive surgery was performed in 39.6% of the patients. Stage 4 was the most common stage among the patients (54%). The overall concordance rates, Cohen's kappa values, and F1 scores were 85.6%, 0.797, and 0.84 for ChatGPT; 67.3%, 0.522, and 0.65 for DeepSeek; and 75.2%, 0.614, and 0.72 for Grok, respectively, with no statistically significant differences between models. Pathological and surgical staging were found to be similar in terms of concordance. The concordance of assessments utilizing only imaging, only pathology notes, only physical examination notes, and comprehensive information was evaluated, revealing no significant differences. Conclusions: Large language models (LLMs) demonstrate relatively high accuracy in staging HNC. With careful implementation and with the consideration of prospective studies, these models have the potential to become valuable tools in oncology practice.
DiagnosticsBiochemistry, Genetics and Molecular Biology-Clinical Biochemistry
CiteScore
4.70
自引率
8.30%
发文量
2699
审稿时长
19.64 days
期刊介绍:
Diagnostics (ISSN 2075-4418) is an international scholarly open access journal on medical diagnostics. It publishes original research articles, reviews, communications and short notes on the research and development of medical diagnostics. There is no restriction on the length of the papers. Our aim is to encourage scientists to publish their experimental and theoretical research in as much detail as possible. Full experimental and/or methodological details must be provided for research articles.