人工智能能够解释未标记的解剖图像，并作为人工智能评分者补充教育研究。

IF 5.2 2区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Anatomical Sciences Education Pub Date : 2025-07-11 DOI:10.1002/ase.70074

Lord J Hyeamang, Tejas C Sekhar, Emily Rush, Amy C Beresheim, Colleen M Cheverko, William S Brooks, Abbey C M Breckling, M Nazmul Karim, Christopher Ferrigno, Adam B Wilson

{"title":"人工智能能够解释未标记的解剖图像，并作为人工智能评分者补充教育研究。","authors":"Lord J Hyeamang, Tejas C Sekhar, Emily Rush, Amy C Beresheim, Colleen M Cheverko, William S Brooks, Abbey C M Breckling, M Nazmul Karim, Christopher Ferrigno, Adam B Wilson","doi":"10.1002/ase.70074","DOIUrl":null,"url":null,"abstract":"Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.","PeriodicalId":124,"journal":{"name":"Anatomical Sciences Education","volume":" ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AI's ability to interpret unlabeled anatomy images and supplement educational research as an AI rater.\",\"authors\":\"Lord J Hyeamang, Tejas C Sekhar, Emily Rush, Amy C Beresheim, Colleen M Cheverko, William S Brooks, Abbey C M Breckling, M Nazmul Karim, Christopher Ferrigno, Adam B Wilson\",\"doi\":\"10.1002/ase.70074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.\",\"PeriodicalId\":124,\"journal\":{\"name\":\"Anatomical Sciences Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anatomical Sciences Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1002/ase.70074\",\"RegionNum\":2,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anatomical Sciences Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1002/ase.70074","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

摘要

有证据表明，在基于文本的解剖学内容查询方面，定制聊天机器人优于商业生成人工智能（GenAI）系统。本研究评估了chatgpt - 40和Claude 3.5 Sonnet解释未标记解剖图像的能力。其次，ChatGPT 01 -preview被评估为人工智能评分者，使用一个规则对人工智能生成的输出进行评分，并与人类评分者进行比较。解剖图像（五张肌肉骨骼，五张胸骨）代表不同的基于图像的媒体（如插图、照片、MRI），用识别标记（如箭头、圆圈）进行注释，并上传到每个GenAI系统进行解释。45个提示（即15个一阶、15个二阶和15个三阶问题）和相关图像在两个时间点上提交给两个GenAI系统。解剖专家根据事实准确性和多余性（过度措辞的存在）对回答进行打分，李克特量表分为三点。ChatGPT 01预览版与人体解剖学专家进行了测试，以确定其作为人工智能评级器的实用性。统计分析包括评分者间的一致性、层次线性模型和重测信度。chatgpt - 40在45个输出中的事实准确性得分为68.0%，而Claude 3.5 Sonnet的得分为61.5% （p = 0.319）。作为一名人工智能评分者，ChatGPT 01 -preview在根据教科书答案的标题评估事实准确性方面与人类评分者（Cohen的kappa = 0.545-0.755）表现出中度至实质性的一致。在商业化的GenAI系统可以作为解剖学教育中可靠的学生资源之前，还需要进一步的改进和评估。同样，ChatGPT 01 -preview也展示了其作为教育研究人工智能助手的前景，不过还需要进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AI's ability to interpret unlabeled anatomy images and supplement educational research as an AI rater.

Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Anatomical Sciences Education Anatomy/education-

CiteScore

10.30

自引率

39.70%

发文量

期刊介绍： Anatomical Sciences Education, affiliated with the American Association for Anatomy, serves as an international platform for sharing ideas, innovations, and research related to education in anatomical sciences. Covering gross anatomy, embryology, histology, and neurosciences, the journal addresses education at various levels, including undergraduate, graduate, post-graduate, allied health, medical (both allopathic and osteopathic), and dental. It fosters collaboration and discussion in the field of anatomical sciences education.