Trent N Cash, Daniel M Oppenheimer, Sara Christie, Mira Devgan
{"title":"量化不确定性:检验法学硕士置信度判断的准确性。","authors":"Trent N Cash, Daniel M Oppenheimer, Sara Christie, Mira Devgan","doi":"10.3758/s13421-025-01755-4","DOIUrl":null,"url":null,"abstract":"<p><p>The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.</p>","PeriodicalId":48398,"journal":{"name":"Memory & Cognition","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments.\",\"authors\":\"Trent N Cash, Daniel M Oppenheimer, Sara Christie, Mira Devgan\",\"doi\":\"10.3758/s13421-025-01755-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.</p>\",\"PeriodicalId\":48398,\"journal\":{\"name\":\"Memory & Cognition\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Memory & Cognition\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.3758/s13421-025-01755-4\",\"RegionNum\":3,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memory & Cognition","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13421-025-01755-4","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}
Quantifying uncert-AI-nty: Testing the accuracy of LLMs' confidence judgments.
The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty-NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109)-and domains of epistemic uncertainty-Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs-especially ChatGPT and Gemini-often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.
期刊介绍:
Memory & Cognition covers human memory and learning, conceptual processes, psycholinguistics, problem solving, thinking, decision making, and skilled performance, including relevant work in the areas of computer simulation, information processing, mathematical psychology, developmental psychology, and experimental social psychology.