令牌概率减轻大型语言模型在回答医学问题时的过度自信。

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-07-01 DOI:10.2196/64348

Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun

{"title":"令牌概率减轻大型语言模型在回答医学问题时的过度自信。","authors":"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun","doi":"10.2196/64348","DOIUrl":null,"url":null,"abstract":"Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.Objective: To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.Methods: Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.Results: Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.Clinicaltrial: ","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":" ","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.\",\"authors\":\"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun\",\"doi\":\"10.2196/64348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.Objective: To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.Methods: Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.Results: Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.Clinicaltrial: \",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/64348\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/64348","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：聊天机器人在医学领域表现出了很好的能力，在各个专业的委员会考试中都取得了及格成绩。然而，即使在不正确的情况下，他们也倾向于对自己的回答表达高度的信心，这限制了他们在临床环境中的效用。目的：研究代币概率在预测其对医疗问题的回答准确性方面是否优于聊天机器人表达的置信度。方法：9个大型语言模型（LLMs），包括商业（GPT-3.5、GPT-4和gpt - 40）和开源（Llama 3.1-8b、Llama 3.1-70b、Phi-3-Mini、Phi-3-Medium、Gemma 2-9b和Gemma 2-27b），被提示回答来自美国医学许可考试（MedQA数据库）的一组2,522个问题。此外，模型将其置信度从0到100进行评级，并提取每个响应的令牌概率。测量模型的成功率，并使用受试者工作特征曲线下面积（auroc）、自适应校准误差（ACE）和Brier评分评估表达置信度和反应令牌概率在预测反应精度方面的预测性能。敏感度分析使用来自其他数据库的附加问题，这些数据库包括英语（MedMCQA, n= 2797）、中文（MedQA中国大陆，n= 3413，台湾，n= 2808）和法语（FrMedMCQA, n= 1079），不同的提示策略和温度设置。结果：总体而言，Phi-3-Mini的平均准确率为56.5% [54.6 - 58.5]，gpt - 40的平均准确率为89.0%[87.7-90.2]。在美国医疗执照考试的问题中，所有聊天机器人都对自己的回答表现出很高的信心（从Llama 3.1-70B的90分[90-90]到GPT-3.5的100分[100-100]不等）。然而，表达信心无法预测反应准确性（AUROC范围从Phi 3 Mini的0.52[0.50-0.53]到gpt - 40的0.68[0.65-0.71]）。相比之下，在预测反应准确性方面，反应令标概率一直优于表达信心（auroc范围从Phi 3 mini的0.71[0.69 - 0.73]到gpt - 40的0.87[0.85 - 0.89]，所有p值）。结论：由于聊天机器人在回答医疗问题时准确评估其信心的能力有限，临床医生和患者应避免依赖他们自评的确定性。相反，令牌概率作为衡量这些模型内部疑虑的一种有前途且易于访问的替代方案出现。临床试验:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.

Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.

Objective: To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.

Methods: Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.

Results: Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.

Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.

Clinicaltrial:

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.