Zuhal Yapıcı Coşkun, Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu, Özhan Özdemir
{"title":"用于生成产科和妇科脚本一致性测试的大型语言模型:ChatGPT和Claude。","authors":"Zuhal Yapıcı Coşkun, Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu, Özhan Özdemir","doi":"10.1080/0142159X.2025.2497888","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To evaluate the performance of large language models (ChatGPT-4o and Claude 3.5 Sonnet) to generate script concordance test (SCT) items for assessing clinical reasoning in obstetrics and gynecology.</p><p><strong>Methods: </strong>This cross-sectional study involved the generation of SCT items for five common diagnostic topics in obstetrics and gynecology in primary care settings. A total of 16 panelists evaluated the AI-generated SCT items against 11 predefined criteria. Descriptive statistics were used to compare the models' performance across criteria.</p><p><strong>Results: </strong>ChatGPT-4o had an overall agreement rate of 90.57% for SCT items meeting the quality criteria, while Claude 3.5 Sonnet achieved 91.48%. The criterion with the lowest scores was \"The scenario is of appropriate difficulty for medical students,\" with ChatGPT-4o rated at 71.25% and Claude 3.5 Sonnet at 76.25%.</p><p><strong>Conclusion: </strong>Large language models can generate SCT items that effectively assess clinical reasoning; however, further refinement is required to ensure the appropriate level of difficulty for medical students. These findings highlight the potential of AI to enhance the efficiency of SCT generation in obstetrics and gynecology within primary care settings.</p>","PeriodicalId":18643,"journal":{"name":"Medical Teacher","volume":" ","pages":"1-5"},"PeriodicalIF":3.3000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models for generating script concordance test in obstetrics and gynecology: ChatGPT and Claude.\",\"authors\":\"Zuhal Yapıcı Coşkun, Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu, Özhan Özdemir\",\"doi\":\"10.1080/0142159X.2025.2497888\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To evaluate the performance of large language models (ChatGPT-4o and Claude 3.5 Sonnet) to generate script concordance test (SCT) items for assessing clinical reasoning in obstetrics and gynecology.</p><p><strong>Methods: </strong>This cross-sectional study involved the generation of SCT items for five common diagnostic topics in obstetrics and gynecology in primary care settings. A total of 16 panelists evaluated the AI-generated SCT items against 11 predefined criteria. Descriptive statistics were used to compare the models' performance across criteria.</p><p><strong>Results: </strong>ChatGPT-4o had an overall agreement rate of 90.57% for SCT items meeting the quality criteria, while Claude 3.5 Sonnet achieved 91.48%. The criterion with the lowest scores was \\\"The scenario is of appropriate difficulty for medical students,\\\" with ChatGPT-4o rated at 71.25% and Claude 3.5 Sonnet at 76.25%.</p><p><strong>Conclusion: </strong>Large language models can generate SCT items that effectively assess clinical reasoning; however, further refinement is required to ensure the appropriate level of difficulty for medical students. These findings highlight the potential of AI to enhance the efficiency of SCT generation in obstetrics and gynecology within primary care settings.</p>\",\"PeriodicalId\":18643,\"journal\":{\"name\":\"Medical Teacher\",\"volume\":\" \",\"pages\":\"1-5\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Teacher\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1080/0142159X.2025.2497888\",\"RegionNum\":2,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Teacher","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1080/0142159X.2025.2497888","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
Large language models for generating script concordance test in obstetrics and gynecology: ChatGPT and Claude.
Objective: To evaluate the performance of large language models (ChatGPT-4o and Claude 3.5 Sonnet) to generate script concordance test (SCT) items for assessing clinical reasoning in obstetrics and gynecology.
Methods: This cross-sectional study involved the generation of SCT items for five common diagnostic topics in obstetrics and gynecology in primary care settings. A total of 16 panelists evaluated the AI-generated SCT items against 11 predefined criteria. Descriptive statistics were used to compare the models' performance across criteria.
Results: ChatGPT-4o had an overall agreement rate of 90.57% for SCT items meeting the quality criteria, while Claude 3.5 Sonnet achieved 91.48%. The criterion with the lowest scores was "The scenario is of appropriate difficulty for medical students," with ChatGPT-4o rated at 71.25% and Claude 3.5 Sonnet at 76.25%.
Conclusion: Large language models can generate SCT items that effectively assess clinical reasoning; however, further refinement is required to ensure the appropriate level of difficulty for medical students. These findings highlight the potential of AI to enhance the efficiency of SCT generation in obstetrics and gynecology within primary care settings.
期刊介绍:
Medical Teacher provides accounts of new teaching methods, guidance on structuring courses and assessing achievement, and serves as a forum for communication between medical teachers and those involved in general education. In particular, the journal recognizes the problems teachers have in keeping up-to-date with the developments in educational methods that lead to more effective teaching and learning at a time when the content of the curriculum—from medical procedures to policy changes in health care provision—is also changing. The journal features reports of innovation and research in medical education, case studies, survey articles, practical guidelines, reviews of current literature and book reviews. All articles are peer reviewed.