Özge Baş Aksu, Rıfat Furkan Aydın, Asena Gökçay Canpolat, Özgür Demir, Mustafa Şahin, Rıfat Emral, Sevim Güllü
{"title":"Artificial intelligence in endocrine practice: comparing ChatGPT, Gemini, and Claude for adrenal incidentaloma care.","authors":"Özge Baş Aksu, Rıfat Furkan Aydın, Asena Gökçay Canpolat, Özgür Demir, Mustafa Şahin, Rıfat Emral, Sevim Güllü","doi":"10.1007/s40618-025-02715-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The clinical use of artificial intelligence (AI) is expanding in endocrinology, yet the performance of large language models (LLMs) in managing adrenal incidentalomas remains uncertain. To compare the performance of four LLMs-ChatGPT-4o, ChatGPT-o1, Google Gemini 2.0, and Claude 3.5-on guideline-based queries and clinical scenarios involving adrenal incidentalomas.</p><p><strong>Methods: </strong>In this cross-sectional study, 34 guideline-derived questions and four case scenarios were presented to the LLMs, covering diagnosis, treatment and follow-up, patient questions, and clinical cases. Six endocrinologists evaluated responses using Likert scales assessing hallucination tendency, quality, usability, reliability, and accuracy. Readability metrics and word counts were also analyzed.</p><p><strong>Results: </strong>No significant differences were found between models in diagnosis (p = 0.86-0.72), treatment and follow-up (p = 0.46-0.10), and patient question (p = 0.78-0.10) categories. However, in complex cases, ChatGPT-4o outperformed ChatGPT-o1 with higher scores in hallucination control (6.5 ± 0.8 vs. 4.8 ± 0.8), quality (6.2 ± 0.8 vs. 5.0 ± 0.6), and usability (4.5 ± 0.8 vs. 3.3 ± 0.5) (all p < 0.05). Readability analysis revealed high text complexity (Flesch-Kincaid Grade Level: 10.6-17.4), and inter-rater reliability was excellent (intraclass correlation coefficient: 0.876-0.961, p < 0.001).</p><p><strong>Conclusion: </strong>LLMs show potential as decision-support tools in adrenal incidentaloma management. While their performance is comparable in routine tasks, significant differences arise in complex cases, highlighting the need for model selection, human oversight, and attention to readability in endocrine practice.</p>","PeriodicalId":48802,"journal":{"name":"Journal of Endocrinological Investigation","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Endocrinological Investigation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40618-025-02715-0","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: The clinical use of artificial intelligence (AI) is expanding in endocrinology, yet the performance of large language models (LLMs) in managing adrenal incidentalomas remains uncertain. To compare the performance of four LLMs-ChatGPT-4o, ChatGPT-o1, Google Gemini 2.0, and Claude 3.5-on guideline-based queries and clinical scenarios involving adrenal incidentalomas.
Methods: In this cross-sectional study, 34 guideline-derived questions and four case scenarios were presented to the LLMs, covering diagnosis, treatment and follow-up, patient questions, and clinical cases. Six endocrinologists evaluated responses using Likert scales assessing hallucination tendency, quality, usability, reliability, and accuracy. Readability metrics and word counts were also analyzed.
Results: No significant differences were found between models in diagnosis (p = 0.86-0.72), treatment and follow-up (p = 0.46-0.10), and patient question (p = 0.78-0.10) categories. However, in complex cases, ChatGPT-4o outperformed ChatGPT-o1 with higher scores in hallucination control (6.5 ± 0.8 vs. 4.8 ± 0.8), quality (6.2 ± 0.8 vs. 5.0 ± 0.6), and usability (4.5 ± 0.8 vs. 3.3 ± 0.5) (all p < 0.05). Readability analysis revealed high text complexity (Flesch-Kincaid Grade Level: 10.6-17.4), and inter-rater reliability was excellent (intraclass correlation coefficient: 0.876-0.961, p < 0.001).
Conclusion: LLMs show potential as decision-support tools in adrenal incidentaloma management. While their performance is comparable in routine tasks, significant differences arise in complex cases, highlighting the need for model selection, human oversight, and attention to readability in endocrine practice.
期刊介绍:
The Journal of Endocrinological Investigation is a well-established, e-only endocrine journal founded 36 years ago in 1978. It is the official journal of the Italian Society of Endocrinology (SIE), established in 1964. Other Italian societies in the endocrinology and metabolism field are affiliated to the journal: Italian Society of Andrology and Sexual Medicine, Italian Society of Obesity, Italian Society of Pediatric Endocrinology and Diabetology, Clinical Endocrinologists’ Association, Thyroid Association, Endocrine Surgical Units Association, Italian Society of Pharmacology.