大语言模型提供的甲状腺结节癌风险评估和管理建议的适宜性。

Mohammad Alarifi
{"title":"大语言模型提供的甲状腺结节癌风险评估和管理建议的适宜性。","authors":"Mohammad Alarifi","doi":"10.1007/s10278-025-01454-1","DOIUrl":null,"url":null,"abstract":"<p><p>The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models.\",\"authors\":\"Mohammad Alarifi\",\"doi\":\"10.1007/s10278-025-01454-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.</p>\",\"PeriodicalId\":516858,\"journal\":{\"name\":\"Journal of imaging informatics in medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of imaging informatics in medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10278-025-01454-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01454-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

该研究根据美国甲状腺协会(ATA)和国家综合癌症网络(NCCN)的临床指南,评估了由大型语言模型(LLMs) ChatGPT、Gemini和Claude提供的甲状腺结节癌风险评估建议的适当性和可靠性。一个由一名医学影像信息学专家和两名放射科医生组成的小组根据ATA和NCCN指南制定了24个临床相关问题。使用可读性评分系统评估人工智能生成的回答的可读性。通过亚马逊土耳其机器人(Amazon Mechanical Turk)招募的来自美国的322名正在接受培训或实习的放射科医生评估了人工智能的反应。定量分析采用SPSS测量建议的适当性,定性反馈采用Dedoose分析。该研究比较了三种人工智能模型ChatGPT、Gemini和Claude在提供适当建议方面的表现。配对样本t检验显示,模型之间的总体性能没有统计学上的显著差异。Claude的平均得分最高(21.84),其次是ChatGPT(21.83)和Gemini(21.47)。不恰当的回复率没有显著差异,尽管双子座的回复率呈上升趋势。然而,ChatGPT在提供适当的反应方面达到了最高的准确性(92.5%),其次是Claude(92.1%)和Gemini(90.4%)。定性反馈强调了ChatGPT的清晰性和结构,Gemini的可访问性但肤浅,Claude的组织偶尔偏离重点。ChatGPT、Gemini和Claude等llm在支持甲状腺结节癌风险评估方面显示出潜力,但需要临床监督以确保与指南保持一致。克劳德和ChatGPT的总体表现几乎相同,克劳德的平均分最高,尽管差异很小。进一步的发展是必要的,以提高其临床应用的可靠性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models.

The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信