Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun
{"title":"令牌概率减轻大型语言模型在回答医学问题时的过度自信。","authors":"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun","doi":"10.2196/64348","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.</p><p><strong>Objective: </strong>To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.</p><p><strong>Methods: </strong>Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.</p><p><strong>Results: </strong>Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.</p><p><strong>Conclusions: </strong>Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":" ","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.\",\"authors\":\"Raphaël Bentegeac, Bastien Le Guellec, Grégory Kuchcinski, Philippe Amouyel, Aghiles Hamroun\",\"doi\":\"10.2196/64348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.</p><p><strong>Objective: </strong>To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.</p><p><strong>Methods: </strong>Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.</p><p><strong>Results: </strong>Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.</p><p><strong>Conclusions: </strong>Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.</p><p><strong>Clinicaltrial: </strong></p>\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/64348\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/64348","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions.
Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.
Objective: To examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.
Methods: Nine large language models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b and Gemma 2-27b), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curves (AUROCs), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079), different prompting strategies and temperature settings.
Results: Overall, mean accuracy ranged from 56.5% [54.6 - 58.5] for Phi-3-Mini to 89.0% [87.7-90.2] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3.1-70B to 100[100-100] for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71 [0.69 - 0.73] for Phi 3 mini to 0.87 [0.85 - 0.89] for GPT-4o, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.
Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.
期刊介绍:
The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades.
As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor.
Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.