通过“Few-shot”提示提高ChatGPT和DeepSeek的CAD-RADS™2.0分类分配性能。

IF 1.3 4区 医学 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Hasan Emin Kaya
{"title":"通过“Few-shot”提示提高ChatGPT和DeepSeek的CAD-RADS™2.0分类分配性能。","authors":"Hasan Emin Kaya","doi":"10.1097/RCT.0000000000001802","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories.</p><p><strong>Methods: </strong>A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis.</p><p><strong>Results: </strong>Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125).</p><p><strong>Conclusion: </strong>Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption.</p>","PeriodicalId":15402,"journal":{"name":"Journal of Computer Assisted Tomography","volume":" ","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing the CAD-RADS™ 2.0 Category Assignment Performance of ChatGPT and DeepSeek Through \\\"Few-shot\\\" Prompting.\",\"authors\":\"Hasan Emin Kaya\",\"doi\":\"10.1097/RCT.0000000000001802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories.</p><p><strong>Methods: </strong>A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis.</p><p><strong>Results: </strong>Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125).</p><p><strong>Conclusion: </strong>Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption.</p>\",\"PeriodicalId\":15402,\"journal\":{\"name\":\"Journal of Computer Assisted Tomography\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Assisted Tomography\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/RCT.0000000000001802\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Assisted Tomography","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/RCT.0000000000001802","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

目的:评估少量提示是否提高了两种流行的大型语言模型(LLMs) (ChatGPT 01和DeepSeek-R1)在分配冠状动脉疾病报告和数据系统(CAD-RADS™2.0)类别方面的性能。方法:使用来自MIMIC-IV数据库的20份报告,基于CAD-RADS™2.0框架开发详细的少射提示。随后,通过模型的用户界面使用zero-shot和few-shot提示对来自同一数据库的100个修改报告进行分类。通过与参考放射科医生的分类(包括狭窄分类和修饰)进行比较来评估模型的准确性。为了评估再现性,使用相同的少数提示对50份报告进行重新分类。采用McNemar检验和Cohen kappa检验进行统计分析。结果:使用零射击提示,两种模型的准确率都很低(ChatGPT: 14%, DeepSeek: 8%),正确的分配几乎只发生在CAD-RADS 0的情况下。经常出现幻觉(ChatGPT: 19%, DeepSeek: 54%)。少量提示显着提高了ChatGPT和DeepSeek的准确率,分别达到98%和93%(两者都是PConclusion:少量提示显着提高了LLMs在分配CAD-RADS™2.0类别时的准确性,这表明了临床应用的潜力并促进了系统的采用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Enhancing the CAD-RADS™ 2.0 Category Assignment Performance of ChatGPT and DeepSeek Through "Few-shot" Prompting.

Objective: To assess whether few-shot prompting improves the performance of 2 popular large language models (LLMs) (ChatGPT o1 and DeepSeek-R1) in assigning Coronary Artery Disease Reporting and Data System (CAD-RADS™ 2.0) categories.

Methods: A detailed few-shot prompt based on CAD-RADS™ 2.0 framework was developed using 20 reports from the MIMIC-IV database. Subsequently, 100 modified reports from the same database were categorized using zero-shot and few-shot prompts through the models' user interface. Model accuracy was evaluated by comparing assignments to a reference radiologist's classifications, including stenosis categories and modifiers. To assess reproducibility, 50 reports were reclassified using the same few-shot prompt. McNemar tests and Cohen kappa were used for statistical analysis.

Results: Using zero-shot prompting, accuracy was low for both models (ChatGPT: 14%, DeepSeek: 8%), with correct assignments occurring almost exclusively in CAD-RADS 0 cases. Hallucinations occurred frequently (ChatGPT: 19%, DeepSeek: 54%). Few-shot prompting significantly improved accuracy to 98% for ChatGPT and 93% for DeepSeek (both P<0.001) and eliminated hallucinations. Kappa values for agreement between model-generated and radiologist-assigned classifications were 0.979 (0.950, 1.000) (P<0.001) for ChatGPT and 0.916 (0.859, 0.973) (P<0.001) for DeepSeek, indicating almost perfect agreement for both models without a significant difference between the models (P=0.180). Reproducibility analysis yielded kappa values of 0.957 (0.900, 1.000) (P<0.001) for ChatGPT and 0.873 [0.779, 0.967] (P<0.001) for DeepSeek, indicating almost perfect and strong agreement between repeated assignments, respectively, with no significant difference between the models (P=0.125).

Conclusion: Few-shot prompting substantially enhances LLMs' accuracy in assigning CAD-RADS™ 2.0 categories, suggesting potential for clinical application and facilitating system adoption.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
230
审稿时长
4-8 weeks
期刊介绍: The mission of Journal of Computer Assisted Tomography is to showcase the latest clinical and research developments in CT, MR, and closely related diagnostic techniques. We encourage submission of both original research and review articles that have immediate or promissory clinical applications. Topics of special interest include: 1) functional MR and CT of the brain and body; 2) advanced/innovative MRI techniques (diffusion, perfusion, rapid scanning); and 3) advanced/innovative CT techniques (perfusion, multi-energy, dose-reduction, and processing).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信