Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying
{"title":"评估基于超声图像的甲状腺结节分类中 ChatGPT-4o 和 Claude 3-Opus 的可行性。","authors":"Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying","doi":"10.1007/s12020-024-04066-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.</p><p><strong>Methods: </strong>This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.</p><p><strong>Conclusion: </strong>While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.</p>","PeriodicalId":49211,"journal":{"name":"Endocrine","volume":" ","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.\",\"authors\":\"Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying\",\"doi\":\"10.1007/s12020-024-04066-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.</p><p><strong>Methods: </strong>This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.</p><p><strong>Conclusion: </strong>While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.</p>\",\"PeriodicalId\":49211,\"journal\":{\"name\":\"Endocrine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s12020-024-04066-x\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-024-04066-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.
Purpose: Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.
Methods: This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.
Results: ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.
Conclusion: While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.
期刊介绍:
Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology.
Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted.
Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.