令牌概率减轻大型语言模型在回答医学问题时的过度自信。

IF 5.8 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun
{"title":"令牌概率减轻大型语言模型在回答医学问题时的过度自信。","authors":"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun","doi":"10.2196/64348","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.</p><p><strong>Objective: </strong>To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.</p><p><strong>Methods: </strong>Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.</p><p><strong>Results: </strong>Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.</p><p><strong>Conclusions: </strong>Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":" ","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.\",\"authors\":\"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun\",\"doi\":\"10.2196/64348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.</p><p><strong>Objective: </strong>To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.</p><p><strong>Methods: </strong>Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.</p><p><strong>Results: </strong>Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.</p><p><strong>Conclusions: </strong>Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.</p><p><strong>Clinicaltrial: </strong></p>\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/64348\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/64348","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:聊天机器人在医学领域表现出了很好的能力,在各个专业的委员会考试中都取得了及格成绩。然而,即使在不正确的情况下,他们也倾向于对自己的回答表达高度的信心,这限制了他们在临床环境中的效用。目的:研究代币概率在预测其对医疗问题的回答准确性方面是否优于聊天机器人表达的置信度。方法:9个大型语言模型(LLMs),包括商业(GPT-3.5、GPT-4和gpt - 40)和开源(Llama 3.1-8b、Llama 3.1-70b、Phi-3-Mini、Phi-3-Medium、Gemma 2-9b和Gemma 2-27b),被提示回答来自美国医学许可考试(MedQA数据库)的一组2,522个问题。此外,模型将其置信度从0到100进行评级,并提取每个响应的令牌概率。测量模型的成功率,并使用受试者工作特征曲线下面积(auroc)、自适应校准误差(ACE)和Brier评分评估表达置信度和反应令牌概率在预测反应精度方面的预测性能。敏感度分析使用来自其他数据库的附加问题,这些数据库包括英语(MedMCQA, n= 2797)、中文(MedQA中国大陆,n= 3413,台湾,n= 2808)和法语(FrMedMCQA, n= 1079),不同的提示策略和温度设置。结果:总体而言,Phi-3-Mini的平均准确率为56.5% [54.6 - 58.5],gpt - 40的平均准确率为89.0%[87.7-90.2]。在美国医疗执照考试的问题中,所有聊天机器人都对自己的回答表现出很高的信心(从Llama 3.1-70B的90分[90-90]到GPT-3.5的100分[100-100]不等)。然而,表达信心无法预测反应准确性(AUROC范围从Phi 3 Mini的0.52[0.50-0.53]到gpt - 40的0.68[0.65-0.71])。相比之下,在预测反应准确性方面,反应令标概率一直优于表达信心(auroc范围从Phi 3 mini的0.71[0.69 - 0.73]到gpt - 40的0.87[0.85 - 0.89],所有p值)。结论:由于聊天机器人在回答医疗问题时准确评估其信心的能力有限,临床医生和患者应避免依赖他们自评的确定性。相反,令牌概率作为衡量这些模型内部疑虑的一种有前途且易于访问的替代方案出现。临床试验:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.

Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.

Objective: To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.

Methods: Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.

Results: Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.

Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.

Clinicaltrial:

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
14.40
自引率
5.40%
发文量
654
审稿时长
1 months
期刊介绍: The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信