chatgpt - 01预览优于ChatGPT-4作为紧急情况下踝关节疼痛分类的诊断支持工具。

IF 2 Q1 EMERGENCY MEDICINE
Archives of Academic Emergency Medicine Pub Date : 2025-04-05 eCollection Date: 2025-01-01 DOI:10.22037/aaemj.v13i1.2580
Pooya Hosseini-Monfared, Shayan Amiri, Alireza Mirahmadi, Amirhossein Shahbazi, Aliasghar Alamian, Mohammad Azizi, Seyed Morteza Kazemi
{"title":"chatgpt - 01预览优于ChatGPT-4作为紧急情况下踝关节疼痛分类的诊断支持工具。","authors":"Pooya Hosseini-Monfared, Shayan Amiri, Alireza Mirahmadi, Amirhossein Shahbazi, Aliasghar Alamian, Mohammad Azizi, Seyed Morteza Kazemi","doi":"10.22037/aaemj.v13i1.2580","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings.</p><p><strong>Methods: </strong>Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order.</p><p><strong>Results: </strong>In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998-0.999) for accuracy scores and 0.99 (95% CI, 0.990-0.995) for clarity scores.</p><p><strong>Conclusion: </strong>Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.</p>","PeriodicalId":8146,"journal":{"name":"Archives of Academic Emergency Medicine","volume":"13 1","pages":"e42"},"PeriodicalIF":2.0000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12145124/pdf/","citationCount":"0","resultStr":"{\"title\":\"ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings.\",\"authors\":\"Pooya Hosseini-Monfared, Shayan Amiri, Alireza Mirahmadi, Amirhossein Shahbazi, Aliasghar Alamian, Mohammad Azizi, Seyed Morteza Kazemi\",\"doi\":\"10.22037/aaemj.v13i1.2580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings.</p><p><strong>Methods: </strong>Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order.</p><p><strong>Results: </strong>In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998-0.999) for accuracy scores and 0.99 (95% CI, 0.990-0.995) for clarity scores.</p><p><strong>Conclusion: </strong>Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.</p>\",\"PeriodicalId\":8146,\"journal\":{\"name\":\"Archives of Academic Emergency Medicine\",\"volume\":\"13 1\",\"pages\":\"e42\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12145124/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Archives of Academic Emergency Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22037/aaemj.v13i1.2580\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"EMERGENCY MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Academic Emergency Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22037/aaemj.v13i1.2580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

简介:ChatGPT是一种通用语言模型,并不是专门针对医疗应用进行优化的。本研究旨在评估ChatGPT-4和o1-preview在紧急情况下对常见踝关节疼痛病例进行鉴别诊断方面的表现。方法:通过咨询经验丰富的骨科医生并查阅相关医院和社交媒体资料,确定常见的踝关节疼痛表现。为了复制典型的患者询问,问题被精心设计成简单的、非技术语言,要求对每种情况进行三种可能的鉴别诊断。第二阶段涉及设计反映分诊护士或医生典型场景的案例插图。ChatGPT的反馈根据两位经验丰富的骨科医生建立的基准进行评估,并根据其顺序使用评分系统评估鉴别诊断的准确性、清晰度和相关性。结果:在21例踝关节疼痛表现中,chatgpt - 01预演在准确性和清晰度方面均优于ChatGPT-4,只有清晰度评分达到统计学意义(p < 0.001)。chatgpt - 01预览版总分也显著高于前者(p = 0.004)。在15个病例小片段中,chatgpt - 01预览在诊断和管理清晰度方面得分更高,尽管诊断准确性方面的差异没有统计学意义。在51个问题中,ChatGPT-4和chatgpt - 01预览分别产生了5个(9.8%)和4个(7.8%)的错误答案。评分者间信度分析显示评分系统具有良好的信度,准确度评分的类间系数为0.99 (95% CI, 0.998-0.999),清晰度评分的类间系数为0.99 (95% CI, 0.990-0.995)。结论:我们的研究结果表明,ChatGPT-4和chatgpt - 01预览版在紧急情况下踝关节疼痛病例的分类中提供了可接受的性能。chatgpt - 01预览版优于ChatGPT-4,提供更清晰、更精确的响应。虽然这两种模型都显示出作为辅助工具的潜力,但它们的作用仍应受到监督,并严格补充临床专业知识。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings.

ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings.

Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings.

Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order.

Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998-0.999) for accuracy scores and 0.99 (95% CI, 0.990-0.995) for clarity scores.

Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Archives of Academic Emergency Medicine
Archives of Academic Emergency Medicine Medicine-Emergency Medicine
CiteScore
8.90
自引率
7.40%
发文量
0
审稿时长
6 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信