Frank I Jackson, Nathan A Keller, Insaf Kouba, Wassil Kouba, Luis A Bracero, Matthew J Blitz
{"title":"面向研究生医学教育的大型语言模型临床小短文和多项选择题。","authors":"Frank I Jackson, Nathan A Keller, Insaf Kouba, Wassil Kouba, Luis A Bracero, Matthew J Blitz","doi":"10.1097/ACM.0000000000006137","DOIUrl":null,"url":null,"abstract":"<p><strong>Abstract: </strong>ProblemClinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.ApproachThe authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.OutcomesThirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs (P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education.Next StepsFuture studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.</p>","PeriodicalId":50929,"journal":{"name":"Academic Medicine","volume":" ","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education.\",\"authors\":\"Frank I Jackson, Nathan A Keller, Insaf Kouba, Wassil Kouba, Luis A Bracero, Matthew J Blitz\",\"doi\":\"10.1097/ACM.0000000000006137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Abstract: </strong>ProblemClinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.ApproachThe authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.OutcomesThirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs (P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education.Next StepsFuture studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.</p>\",\"PeriodicalId\":50929,\"journal\":{\"name\":\"Academic Medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Academic Medicine\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1097/ACM.0000000000006137\",\"RegionNum\":2,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Medicine","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1097/ACM.0000000000006137","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
摘要
摘要:基于临床小情节的多项选择题(mcq)已被用于评估医学研究生,但需要大量的时间和精力来开发。大型语言模型,一种人工智能(AI),可以潜在地加速这项任务。本报告描述了ChatGPT-4用于为妇产科住院医生生成临床小视频和mcq的即时工程技术,并评估了住院医生和主治医生是否能够区分人工生成和人工智能生成的内容。作者使用结构化提示工程方法生成mcq,结合权威源文档和迭代提示链技术,以改进输出质量。50个人工生成的mcq和50个人工智能生成的mcq随机分为10个测验(每个测验10个问题)。人工智能生成的mcq于2024年8月开发,并于2024年9月进行了调查。诺斯韦尔健康中心(Northwell Health)的妇产科住院医生和主治医师,以及霍夫斯特拉/诺斯韦尔的唐纳德和芭芭拉·扎克医学院(Donald and Barbara Zucker School of Medicine)的医生,完成了一项在线调查,回答每个MCQ,并表明他们是相信这是人类还是人工智能写的,还是不确定。结果33名参与者(16名住院医师,17名主治医师)完成了调查,回复率为80.5%。受访者正确识别MCQ作者的中位数(四分位数范围[IQR])为39.1%(30.0%-50.0%),表明很难区分人类和人工智能生成的问题。人工生成的mcq中位数(IQR)正确答案选择率为62.3%(50.0%-75.0%),人工智能生成的mcq为64.4% (50.0%-83.3%)(P = 0.74)。难度指数(0.69 vs 0.66, P = 0.83)和歧视性指数(0.42 vs 0.38, P = 0.90)无显著差异,支持大型语言模型生成mcq在医学教育中的可行性。未来的研究应该探索人工智能生成的内容和专家评审之间的最佳平衡,确定在不影响准确性的情况下最大化效率的策略。作者将开发实践考试,并通过比较分数与标准化考试结果来评估其预测有效性。
Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education.
Abstract: ProblemClinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.ApproachThe authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.OutcomesThirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs (P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education.Next StepsFuture studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.
期刊介绍:
Academic Medicine, the official peer-reviewed journal of the Association of American Medical Colleges, acts as an international forum for exchanging ideas, information, and strategies to address the significant challenges in academic medicine. The journal covers areas such as research, education, clinical care, community collaboration, and leadership, with a commitment to serving the public interest.