Teresa P. Nguyen , Brendan Carvalho , Hannah Sukhdeo , Kareem Joudi , Nan Guo , Marianne Chen , Jed T. Wolpaw , Jesse J. Kiefer , Melissa Byrne , Tatiana Jamroz , Allison A. Mootz , Sharon C. Reale , James Zou , Pervez Sultan
{"title":"人工智能大型语言模型聊天机器人在回答麻醉常见问题方面的比较","authors":"Teresa P. Nguyen , Brendan Carvalho , Hannah Sukhdeo , Kareem Joudi , Nan Guo , Marianne Chen , Jed T. Wolpaw , Jesse J. Kiefer , Melissa Byrne , Tatiana Jamroz , Allison A. Mootz , Sharon C. Reale , James Zou , Pervez Sultan","doi":"10.1016/j.bjao.2024.100280","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.</p></div><div><h3>Methods</h3><p>Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).</p></div><div><h3>Results</h3><p>ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3–4], 4 [3–4], and 3 [2–4], respectively; <em>P</em><0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2–4] <em>vs</em> Bing Chat, 2 [2–3], <em>P</em><0.001; and Bard 3 [2–4] <em>vs</em> Bing Chat, 2 [2–3], <em>P</em>=0.002). All large language model chatbots performed well with no statistical difference for understandability (<em>P</em>=0.24), empathy (<em>P</em>=0.032), and ethics (<em>P</em>=0.465).</p></div><div><h3>Conclusions</h3><p>In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.</p></div>","PeriodicalId":72418,"journal":{"name":"BJA open","volume":"10 ","pages":"Article 100280"},"PeriodicalIF":0.0000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772609624000248/pdfft?md5=3069a1d67d9065d3c6a3fb1ea7230c29&pid=1-s2.0-S2772609624000248-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia\",\"authors\":\"Teresa P. Nguyen , Brendan Carvalho , Hannah Sukhdeo , Kareem Joudi , Nan Guo , Marianne Chen , Jed T. Wolpaw , Jesse J. Kiefer , Melissa Byrne , Tatiana Jamroz , Allison A. Mootz , Sharon C. Reale , James Zou , Pervez Sultan\",\"doi\":\"10.1016/j.bjao.2024.100280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.</p></div><div><h3>Methods</h3><p>Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).</p></div><div><h3>Results</h3><p>ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3–4], 4 [3–4], and 3 [2–4], respectively; <em>P</em><0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2–4] <em>vs</em> Bing Chat, 2 [2–3], <em>P</em><0.001; and Bard 3 [2–4] <em>vs</em> Bing Chat, 2 [2–3], <em>P</em>=0.002). All large language model chatbots performed well with no statistical difference for understandability (<em>P</em>=0.24), empathy (<em>P</em>=0.032), and ethics (<em>P</em>=0.465).</p></div><div><h3>Conclusions</h3><p>In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.</p></div>\",\"PeriodicalId\":72418,\"journal\":{\"name\":\"BJA open\",\"volume\":\"10 \",\"pages\":\"Article 100280\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2772609624000248/pdfft?md5=3069a1d67d9065d3c6a3fb1ea7230c29&pid=1-s2.0-S2772609624000248-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BJA open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772609624000248\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJA open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772609624000248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia
Background
Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.
Methods
Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).
Results
ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3–4], 4 [3–4], and 3 [2–4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2–4] vs Bing Chat, 2 [2–3], P<0.001; and Bard 3 [2–4] vs Bing Chat, 2 [2–3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465).
Conclusions
In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.