Exploring the Potential of ChatGPT-4 for Clinical Decision Support in Cardiac Electrophysiology and Its Semi-Automatic Evaluation Metrics

Xiarepati Tieliwaerdi, Abulikemu Abuduweili, Saleh Saleh, Erasmus Mutabi, Michael A Rosenberg, Emerson Liu
{"title":"Exploring the Potential of ChatGPT-4 for Clinical Decision Support in Cardiac Electrophysiology and Its Semi-Automatic Evaluation Metrics","authors":"Xiarepati Tieliwaerdi, Abulikemu Abuduweili, Saleh Saleh, Erasmus Mutabi, Michael A Rosenberg, Emerson Liu","doi":"10.1101/2024.07.10.24310247","DOIUrl":null,"url":null,"abstract":"Background/Aim: Despite extensive research in other medical fields, the capabilities of ChatGPT-4 in clinical decision support within cardiac electrophysiology (EP) remain largely unexplored. This study aims to enhance ChatGPT- 4`s domain-specific expertise by employing the Retrieval-Augmented Generation (RAG) approach, which integrates up-to-date, evidence-based knowledge into ChatGPT-4`s foundational database. Additionally, we plan to explore the use of commonly used automatic evaluation metrics in natural language processing, such as BERTScore, BLEURT, and cosine similarity, alongside human evaluation, to develop a semi-automatic framework. This aims to reduce dependency on exhaustive human evaluations, addressing the need for efficient and scalable assessment tools in medical decision-making, given the rapid adoption of ChatGPT-4 by the public. Method: We analyzed five atrial fibrillation (Afib) cases and seven cardiac implantable electronic device (CIED) infection cases curated from PubMed case reports. We conducted a total of 120 experiments for Afib and 168 for CIED cases, testing each case across four temperature settings (0, 0.5, 1, 1.2) and three seed settings (1, 2, 3). ChatGPT-4`s performance was assessed under two modes: the Retrieval-Augmented Generation (RAG) mode and the Cold Turkey mode, which queries ChatGPT without external knowledge via RAG. For Afib cases, ChatGPT was asked to determine rate, rhythm, and anticoagulation options, and provide reasoning for each. For CIED cases, ChatGPT is asked to determine the presence of device infections. Accuracy metrics evaluated the determination component, while reasoning was assessed by human evaluation, BERTScore, BLEURT, and cosine similarity. A mixed effects analysis was used to compare the performance under both models across varying seeds and temperatures. Spearman`s rank correlation was used to explore the relationship between automatic metrics and human evaluation. Results: In this study, 120 experiments for Afib and 168 for CIED were conducted. There is no significant difference between the RAG mode and the Cold Turkey mode across various metrics including determination accuracy, reasoning similarity, and human evaluation scores, although RAG achieved higher cosine similarity scores in Afib cases (0.82 vs. 0.75) and better accuracy in CIED cases (0.70 vs. 0.66), though these differences were not statistically significant due to the small sample size. Our mixed effects analysis revealed no significant effects of temperature or method interactions, indicating stable performance across these variables. Moreover, while no individual evaluation metric, such as BERTScore, BLEURT or cosine similarity, showed a high correlation with human evaluations. However, the ACC-Sim metric, which averages accuracy and cosine similarity, exhibits the highest correlation with human evaluation, with Spearman`s ρ at 0.86 and a P value < 0.001, indicating a significant ordinal correlation between ACC-Sim and human evaluation. This suggests its potential as a surrogate for human evaluation in similar medical scenarios.\nConclusion: Our study did not find a significant difference between the RAG and Cold Turkey methods in terms of ChatGPT-4`s clinical decision-making performance in Afib and CIED infection management. The ACC-Sim metric closely aligns with human evaluations in these specific medical contexts and shows promise for integration into a semi-automatic evaluation framework.","PeriodicalId":501297,"journal":{"name":"medRxiv - Cardiovascular Medicine","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Cardiovascular Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.10.24310247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background/Aim: Despite extensive research in other medical fields, the capabilities of ChatGPT-4 in clinical decision support within cardiac electrophysiology (EP) remain largely unexplored. This study aims to enhance ChatGPT- 4`s domain-specific expertise by employing the Retrieval-Augmented Generation (RAG) approach, which integrates up-to-date, evidence-based knowledge into ChatGPT-4`s foundational database. Additionally, we plan to explore the use of commonly used automatic evaluation metrics in natural language processing, such as BERTScore, BLEURT, and cosine similarity, alongside human evaluation, to develop a semi-automatic framework. This aims to reduce dependency on exhaustive human evaluations, addressing the need for efficient and scalable assessment tools in medical decision-making, given the rapid adoption of ChatGPT-4 by the public. Method: We analyzed five atrial fibrillation (Afib) cases and seven cardiac implantable electronic device (CIED) infection cases curated from PubMed case reports. We conducted a total of 120 experiments for Afib and 168 for CIED cases, testing each case across four temperature settings (0, 0.5, 1, 1.2) and three seed settings (1, 2, 3). ChatGPT-4`s performance was assessed under two modes: the Retrieval-Augmented Generation (RAG) mode and the Cold Turkey mode, which queries ChatGPT without external knowledge via RAG. For Afib cases, ChatGPT was asked to determine rate, rhythm, and anticoagulation options, and provide reasoning for each. For CIED cases, ChatGPT is asked to determine the presence of device infections. Accuracy metrics evaluated the determination component, while reasoning was assessed by human evaluation, BERTScore, BLEURT, and cosine similarity. A mixed effects analysis was used to compare the performance under both models across varying seeds and temperatures. Spearman`s rank correlation was used to explore the relationship between automatic metrics and human evaluation. Results: In this study, 120 experiments for Afib and 168 for CIED were conducted. There is no significant difference between the RAG mode and the Cold Turkey mode across various metrics including determination accuracy, reasoning similarity, and human evaluation scores, although RAG achieved higher cosine similarity scores in Afib cases (0.82 vs. 0.75) and better accuracy in CIED cases (0.70 vs. 0.66), though these differences were not statistically significant due to the small sample size. Our mixed effects analysis revealed no significant effects of temperature or method interactions, indicating stable performance across these variables. Moreover, while no individual evaluation metric, such as BERTScore, BLEURT or cosine similarity, showed a high correlation with human evaluations. However, the ACC-Sim metric, which averages accuracy and cosine similarity, exhibits the highest correlation with human evaluation, with Spearman`s ρ at 0.86 and a P value < 0.001, indicating a significant ordinal correlation between ACC-Sim and human evaluation. This suggests its potential as a surrogate for human evaluation in similar medical scenarios. Conclusion: Our study did not find a significant difference between the RAG and Cold Turkey methods in terms of ChatGPT-4`s clinical decision-making performance in Afib and CIED infection management. The ACC-Sim metric closely aligns with human evaluations in these specific medical contexts and shows promise for integration into a semi-automatic evaluation framework.
探索 ChatGPT-4 在心脏电生理学临床决策支持中的潜力及其半自动评估指标
背景/目的:尽管在其他医学领域开展了广泛的研究,但 ChatGPT-4 在心脏电生理学(EP)领域的临床决策支持功能在很大程度上仍未得到开发。本研究旨在通过采用检索-增强生成(RAG)方法,将最新的循证知识整合到 ChatGPT-4 的基础数据库中,从而增强 ChatGPT-4 的特定领域专业知识。此外,我们还计划探索在自然语言处理中使用常用的自动评估指标,如 BERTScore、BLEURT 和余弦相似度,并结合人工评估来开发一个半自动框架。这样做的目的是减少对详尽的人工评估的依赖,在公众迅速采用 ChatGPT-4 的情况下,满足医疗决策对高效、可扩展的评估工具的需求。方法:我们分析了五个心房颤动(Afib)病例和七个心脏植入式电子设备(CIED)感染病例,这些病例均来自 PubMed 病例报告。我们对房颤病例进行了 120 次实验,对 CIED 病例进行了 168 次实验,在四种温度设置(0、0.5、1、1.2)和三种种子设置(1、2、3)下对每个病例进行了测试。ChatGPT-4 的性能在两种模式下进行了评估:检索-增强生成(RAG)模式和冷火鸡模式(通过 RAG 在没有外部知识的情况下查询 ChatGPT)。对于 Afib 病例,要求 ChatGPT 确定心率、心律和抗凝选项,并为每个选项提供推理。对于 CIED 病例,要求 ChatGPT 确定是否存在设备感染。准确度指标评估了确定部分,而推理则通过人工评估、BERTScore、BLEURT 和余弦相似度进行评估。混合效应分析用于比较两种模型在不同种子和温度下的性能。斯皮尔曼等级相关性用于探讨自动指标与人工评估之间的关系。结果:在这项研究中,对 Afib 进行了 120 次实验,对 CIED 进行了 168 次实验。RAG 模式和 Cold Turkey 模式在判定准确度、推理相似度和人类评价得分等各项指标上没有明显差异,但 RAG 在 Afib 案例中获得了更高的余弦相似度得分(0.82 对 0.75),在 CIED 案例中获得了更好的准确度(0.70 对 0.66),不过由于样本量较小,这些差异在统计学上并不显著。我们的混合效应分析表明,温度或方法的交互作用没有显著影响,这表明在这些变量之间性能稳定。此外,虽然 BERTScore、BLEURT 或余弦相似度等单个评价指标都没有显示出与人类评价的高度相关性。然而,ACC-Sim 指标(准确度和余弦相似度的平均值)与人类评价的相关性最高,Spearman ρ 为 0.86,P 值为 0.001,表明 ACC-Sim 与人类评价之间存在显著的序相关性。这表明在类似的医疗场景中,ACC-Sim 有潜力替代人类评估:我们的研究没有发现 RAG 和 Cold Turkey 方法在阿菲搏和 CIED 感染管理的 ChatGPT-4`s 临床决策性能方面存在明显差异。ACC-Sim 指标与人类在这些特定医疗环境中的评估结果非常吻合,有望整合到半自动评估框架中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信