Ayla Tekin, Nizameddin Fatih Karamus, Tuncay Çolak
{"title":"Anatomy exam model for the circulatory and respiratory systems using GPT-4: a medical school study.","authors":"Ayla Tekin, Nizameddin Fatih Karamus, Tuncay Çolak","doi":"10.1007/s00276-025-03667-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The study aimed to evaluate the effectiveness of anatomy multiple-choice questions (MCQs) generated by GPT-4, focused on their methodological appropriateness and alignment with the cognitive levels defined by Bloom's revised taxonomy to enhance assessment.</p><p><strong>Methods: </strong>The assessment questions developed for medical students were created utilizing GPT-4, comprising 240 MCQs organized into subcategories consistent with Bloom's revised taxonomy. When designing prompts to create MCQs, details about the lesson's purpose, learning objectives, and students' prior experiences were included to ensure the questions were contextually appropriate. A set of 30 MCQs was randomly selected from the generated questions for testing. A total of 280 students participated in the examination, which assessed the difficulty index of the MCQs, the item discrimination index, and the overall test difficulty level. Expert anatomists examined the taxonomy accuracy of GPT-4's questions.</p><p><strong>Results: </strong>Students achieved a median score of 50 (range, 36.67-60) points on the test. The test's internal consistency, assessed by KR-20, was 0.737. The average difficulty of the test was 0.5012. Results show difficulty and discrimination indices for each AI-generated question. Expert anatomists' taxonomy-based classifications matched GPT-4's 26.6%. Meanwhile, 80.9% of students found the questions were clear, and 85.8% showed interest in retaking the assessment exam.</p><p><strong>Conclusion: </strong>This study demonstrates GPT-4's significant potential for generating medical education exam questions. While it effectively assesses basic knowledge recall, it fails to sufficiently evaluate higher-order cognitive processes outlined in Bloom's revised taxonomy. Future research should consider alternative methods that combine AI with expert evaluation and specialized multimodal models.</p>","PeriodicalId":49461,"journal":{"name":"Surgical and Radiologic Anatomy","volume":"47 1","pages":"158"},"PeriodicalIF":1.4000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical and Radiologic Anatomy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00276-025-03667-z","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: The study aimed to evaluate the effectiveness of anatomy multiple-choice questions (MCQs) generated by GPT-4, focused on their methodological appropriateness and alignment with the cognitive levels defined by Bloom's revised taxonomy to enhance assessment.
Methods: The assessment questions developed for medical students were created utilizing GPT-4, comprising 240 MCQs organized into subcategories consistent with Bloom's revised taxonomy. When designing prompts to create MCQs, details about the lesson's purpose, learning objectives, and students' prior experiences were included to ensure the questions were contextually appropriate. A set of 30 MCQs was randomly selected from the generated questions for testing. A total of 280 students participated in the examination, which assessed the difficulty index of the MCQs, the item discrimination index, and the overall test difficulty level. Expert anatomists examined the taxonomy accuracy of GPT-4's questions.
Results: Students achieved a median score of 50 (range, 36.67-60) points on the test. The test's internal consistency, assessed by KR-20, was 0.737. The average difficulty of the test was 0.5012. Results show difficulty and discrimination indices for each AI-generated question. Expert anatomists' taxonomy-based classifications matched GPT-4's 26.6%. Meanwhile, 80.9% of students found the questions were clear, and 85.8% showed interest in retaking the assessment exam.
Conclusion: This study demonstrates GPT-4's significant potential for generating medical education exam questions. While it effectively assesses basic knowledge recall, it fails to sufficiently evaluate higher-order cognitive processes outlined in Bloom's revised taxonomy. Future research should consider alternative methods that combine AI with expert evaluation and specialized multimodal models.
期刊介绍:
Anatomy is a morphological science which cannot fail to interest the clinician. The practical application of anatomical research to clinical problems necessitates special adaptation and selectivity in choosing from numerous international works. Although there is a tendency to believe that meaningful advances in anatomy are unlikely, constant revision is necessary. Surgical and Radiologic Anatomy, the first international journal of Clinical anatomy has been created in this spirit.
Its goal is to serve clinicians, regardless of speciality-physicians, surgeons, radiologists or other specialists-as an indispensable aid with which they can improve their knowledge of anatomy. Each issue includes: Original papers, review articles, articles on the anatomical bases of medical, surgical and radiological techniques, articles of normal radiologic anatomy, brief reviews of anatomical publications of clinical interest.
Particular attention is given to high quality illustrations, which are indispensable for a better understanding of anatomical problems.
Surgical and Radiologic Anatomy is a journal written by anatomists for clinicians with a special interest in anatomy.