{"title":"大语言模型提供的甲状腺结节癌风险评估和管理建议的适宜性。","authors":"Mohammad Alarifi","doi":"10.1007/s10278-025-01454-1","DOIUrl":null,"url":null,"abstract":"<p><p>The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models.\",\"authors\":\"Mohammad Alarifi\",\"doi\":\"10.1007/s10278-025-01454-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.</p>\",\"PeriodicalId\":516858,\"journal\":{\"name\":\"Journal of imaging informatics in medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of imaging informatics in medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10278-025-01454-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01454-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models.
The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.