Samuel N Blacker,Fei Chen,Daniel Winecoff,Benjamin L Antonio,Harendra Arora,Bryan J Hierlmeier,Rachel M Kacmar,Anthony N Passannante,Anthony R Plunkett,David Zvara,Benjamin Cobb,Alexander Doyal,Daniel Rosenkrans,Kenneth Bradbury Brown,Michael A Gonzalez,Courtney Hood,Tiffany T Pham,Abhijit V Lele,Lesley Hall,Ameer Ali,Robert S Isaak
{"title":"麻醉学口试中 ChatGPT 与人类表现的对比探索性分析:初步见解和启示。","authors":"Samuel N Blacker,Fei Chen,Daniel Winecoff,Benjamin L Antonio,Harendra Arora,Bryan J Hierlmeier,Rachel M Kacmar,Anthony N Passannante,Anthony R Plunkett,David Zvara,Benjamin Cobb,Alexander Doyal,Daniel Rosenkrans,Kenneth Bradbury Brown,Michael A Gonzalez,Courtney Hood,Tiffany T Pham,Abhijit V Lele,Lesley Hall,Ameer Ali,Robert S Isaak","doi":"10.1213/ane.0000000000006875","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nChat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information.\r\n\r\nMETHODS\r\nFour anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors.\r\n\r\nRESULTS\r\nThe anesthesiology fellow's answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact \"hallucinations.\"\r\n\r\nCONCLUSIONS\r\nChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.","PeriodicalId":7799,"journal":{"name":"Anesthesia & Analgesia","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications.\",\"authors\":\"Samuel N Blacker,Fei Chen,Daniel Winecoff,Benjamin L Antonio,Harendra Arora,Bryan J Hierlmeier,Rachel M Kacmar,Anthony N Passannante,Anthony R Plunkett,David Zvara,Benjamin Cobb,Alexander Doyal,Daniel Rosenkrans,Kenneth Bradbury Brown,Michael A Gonzalez,Courtney Hood,Tiffany T Pham,Abhijit V Lele,Lesley Hall,Ameer Ali,Robert S Isaak\",\"doi\":\"10.1213/ane.0000000000006875\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BACKGROUND\\r\\nChat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information.\\r\\n\\r\\nMETHODS\\r\\nFour anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors.\\r\\n\\r\\nRESULTS\\r\\nThe anesthesiology fellow's answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact \\\"hallucinations.\\\"\\r\\n\\r\\nCONCLUSIONS\\r\\nChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.\",\"PeriodicalId\":7799,\"journal\":{\"name\":\"Anesthesia & Analgesia\",\"volume\":\"46 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anesthesia & Analgesia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1213/ane.0000000000006875\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anesthesia & Analgesia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1213/ane.0000000000006875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications.
BACKGROUND
Chat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information.
METHODS
Four anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors.
RESULTS
The anesthesiology fellow's answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact "hallucinations."
CONCLUSIONS
ChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.