使用欧洲泌尿外科协会指南增强的可解释语言模型,在泌尿外科委员会问题上取得超人成绩

M.J. Hetz , N. Carl , S. Haggenmüller , C. Wies , J.N. Kather , M.S. Michel , F. Wessels , T.J. Brinker
{"title":"使用欧洲泌尿外科协会指南增强的可解释语言模型,在泌尿外科委员会问题上取得超人成绩","authors":"M.J. Hetz ,&nbsp;N. Carl ,&nbsp;S. Haggenmüller ,&nbsp;C. Wies ,&nbsp;J.N. Kather ,&nbsp;M.S. Michel ,&nbsp;F. Wessels ,&nbsp;T.J. Brinker","doi":"10.1016/j.esmorw.2024.100078","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.</div></div><div><h3>Materials and methods</h3><div>We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).</div></div><div><h3>Results</h3><div>UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.</div></div><div><h3>Conclusions</h3><div>UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.</div></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"6 ","pages":"Article 100078"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines\",\"authors\":\"M.J. Hetz ,&nbsp;N. Carl ,&nbsp;S. Haggenmüller ,&nbsp;C. Wies ,&nbsp;J.N. Kather ,&nbsp;M.S. Michel ,&nbsp;F. Wessels ,&nbsp;T.J. Brinker\",\"doi\":\"10.1016/j.esmorw.2024.100078\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.</div></div><div><h3>Materials and methods</h3><div>We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).</div></div><div><h3>Results</h3><div>UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.</div></div><div><h3>Conclusions</h3><div>UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.</div></div>\",\"PeriodicalId\":100491,\"journal\":{\"name\":\"ESMO Real World Data and Digital Oncology\",\"volume\":\"6 \",\"pages\":\"Article 100078\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESMO Real World Data and Digital Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000560\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820124000560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景大语言模型对临床知识进行编码,无需进一步训练即可立即回答医学专家的问题。然而,由于训练数据过时且缺乏可解释性,这种 "零 "反馈的性能受到了限制,阻碍了临床翻译。我们的目标是开发一个泌尿科专业聊天机器人(UroBot),并以完全可由临床医生验证的方式,对照最先进的模型以及历史上泌尿科医生在回答泌尿科委员会问题时的表现对其进行评估。材料与方法我们开发了 UroBot,这是一个基于 OpenAI 的 GPT-3.5、GPT-4 和 GPT-4o 模型的软件管道,利用了检索增强生成和 2023 年欧洲泌尿外科协会指南。UroBot 以 GPT-3.5、GPT-4、GPT-4o 和 Uro_Chat 的零点性能为基准进行了评估。结果UroBot-4o的RoCA最高,平均为88.4%,比GPT-4o(77.6%)高出10.8%。此外,它还可由临床医生验证,并通过 Fleiss' kappa(κ = 0.979)测量显示出运行之间的最高一致性。相比之下,根据文献报道,泌尿科医生在回答泌尿科委员会问题时的平均成绩为 68.7%。结论UroBot 是一款可由临床医生验证的准确软件管道,在回答泌尿科委员会问题方面优于已发表的模型和泌尿科医生。我们提供了使用和扩展 UroBot 的代码和说明,以便进一步开发。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines

Background

Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists’ performance in answering urological board questions in a fully clinician-verifiable manner.

Materials and methods

We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).

Results

UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss’ kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.

Conclusions

UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信