Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala
{"title":"Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience","authors":"Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala","doi":"10.1101/2024.07.23.24310847","DOIUrl":null,"url":null,"abstract":"<strong>Importance</strong> The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. <strong>Objective</strong> To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. <strong>Design</strong> Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. <strong>Setting</strong> This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. <strong>Participants</strong> A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. <strong>Interventions</strong> Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. <strong>Main Outcome(s) and Measure(s)</strong> The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.\n<strong>Results</strong> Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. <strong>Conclusions and Relevance</strong> DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.23.24310847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Importance The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. Objective To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. Design Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. Setting This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. Participants A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. Interventions Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. Main Outcome(s) and Measure(s) The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT. Results Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. Conclusions and Relevance DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.
通过分析诊断准确性和用户体验评估基于 GPT-4 的大语言模型 DxGPT 的临床实用性
重要性 准确诊断罕见儿科疾病往往需要数年时间。在真实儿科病例中评估基于 LLM 的工具的诊断准确性有助于缩短诊断时间,为患者及其家属提供更快的诊断。目标 评估 DxGPT 作为常见病和罕见病鉴别诊断辅助工具的临床实用性。设计 单中心描述性横断面探索性研究。使用 50 名儿科患者的匿名病史数据生成临床病例记录,涵盖常见病和罕见病。每个临床病例都包括基本数据,其中一些数据还通过补充检验进行了扩展。研究地点 本研究在一家儿科参考医院--巴塞罗那 Sant Joan de Déu 儿童医院进行。参与者 78名经验各异的志愿医生(医疗诊断小组)共诊断了50个临床病例,每人审查3个临床病例。干预措施 每位临床医生在每份临床病例记录中最多列出 5 项诊断。在 DxGPT 网络平台上也进行了同样的操作,获得了前 5 项诊断建议。为评估 DxGPT 的可变性,每份病例记录被查询三次。主要结果和衡量标准 该研究主要侧重于比较医疗诊断团队和 DxGPT 的诊断准确性,即诊断正确的病例百分比。其他评估标准包括定性评估。医疗诊断团队还完成了一份关于 DxGPT 用户体验的调查。常见疾病的准确率(临床医生:79%,DxGPT:71%)高于罕见疾病(临床医生:50%,DxGPT:49%)。扩充信息后,两组的准确率都有类似的提高,但只有临床医生的准确率有显著提高(简单 52% 对扩充 69%;P=0.03)。DxGPT 的反应变异对临床病例记录的影响不到 5%。对 48 名临床医生进行的调查显示,DxGPT 平台的总体评分为 3.9/5,实用性评分为 4.1/5,可用性评分为 4.5/5。结论和相关性 DxGPT 显示的诊断准确率与儿科医院的医务人员相似,这表明它在其他环境下支持鉴别诊断的潜力。临床医生对其可用性和简易性大加赞赏。这些工具可为具有挑战性的诊断病例提供新的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信