大语言模型在辅助生殖技术数据分析和医学教育中的应用：比较研究。

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2025-10-01 DOI:10.2196/70107

Noriyuki Okuyama, Mika Ishii, Yuriko Fukuoka, Hiromitsu Hattori, Yuta Kasahara, Tai Toshihiro, Koki Yoshinaga, Tomoko Hashimoto, Koichi Kyono

{"title":"大语言模型在辅助生殖技术数据分析和医学教育中的应用：比较研究。","authors":"Noriyuki Okuyama, Mika Ishii, Yuriko Fukuoka, Hiromitsu Hattori, Yuta Kasahara, Tai Toshihiro, Koki Yoshinaga, Tomoko Hashimoto, Koichi Kyono","doi":"10.2196/70107","DOIUrl":null,"url":null,"abstract":"Background: Recent studies have demonstrated that large language models exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different large language models have not been extensively conducted.Objective: The purpose of this study was to evaluate whether advanced artificial intelligence (AI) models can analyze data from template-based input and demonstrate basic knowledge of reproductive medicine. Four AI models (GPT-4, GPT-4o, Claude 3.5 Sonnet, and Gemini Pro 1.5) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed using 10 examination questions developed by experts.Methods: First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation used 10 questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated 10 times per AI model.Results: GPT-4o achieved grade A output (defined as achieving the objective with a single output attempt) in 9 out of 10 trials, outperforming GPT-4, which achieved grade A in 7 out of 10. The average processing times for data analysis were 26.8 (SD 3.7) seconds for GPT-4o and 36.7 (SD 3) seconds for GPT-4, whereas Claude failed in all 10 attempts. Gemini achieved an average processing time of 23 (SD 3) seconds and received grade A in 6 out of 10 trials, though occasional manual corrections were needed. Embryologists required an average of 358.3 (SD 9.7) seconds for the same tasks. In the knowledge-based assessment, GPT-4o, Claude, and Gemini achieved perfect scores (9/9) on multiple-choice questions, while GPT-4 showed a 60% (6/10) success rate on 1 question. None of the AI models could reliably diagnose chromosomal abnormalities from karyotype images, with the highest image diagnostic accuracy being 70% (7/10) for Claude and Gemini.Conclusions: This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e70107"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488165/pdf/","citationCount":"0","resultStr":"{\"title\":\"Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study.\",\"authors\":\"Noriyuki Okuyama, Mika Ishii, Yuriko Fukuoka, Hiromitsu Hattori, Yuta Kasahara, Tai Toshihiro, Koki Yoshinaga, Tomoko Hashimoto, Koichi Kyono\",\"doi\":\"10.2196/70107\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Recent studies have demonstrated that large language models exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different large language models have not been extensively conducted.Objective: The purpose of this study was to evaluate whether advanced artificial intelligence (AI) models can analyze data from template-based input and demonstrate basic knowledge of reproductive medicine. Four AI models (GPT-4, GPT-4o, Claude 3.5 Sonnet, and Gemini Pro 1.5) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed using 10 examination questions developed by experts.Methods: First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation used 10 questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated 10 times per AI model.Results: GPT-4o achieved grade A output (defined as achieving the objective with a single output attempt) in 9 out of 10 trials, outperforming GPT-4, which achieved grade A in 7 out of 10. The average processing times for data analysis were 26.8 (SD 3.7) seconds for GPT-4o and 36.7 (SD 3) seconds for GPT-4, whereas Claude failed in all 10 attempts. Gemini achieved an average processing time of 23 (SD 3) seconds and received grade A in 6 out of 10 trials, though occasional manual corrections were needed. Embryologists required an average of 358.3 (SD 9.7) seconds for the same tasks. In the knowledge-based assessment, GPT-4o, Claude, and Gemini achieved perfect scores (9/9) on multiple-choice questions, while GPT-4 showed a 60% (6/10) success rate on 1 question. None of the AI models could reliably diagnose chromosomal abnormalities from karyotype images, with the highest image diagnostic accuracy being 70% (7/10) for Claude and Gemini.Conclusions: This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.\",\"PeriodicalId\":14841,\"journal\":{\"name\":\"JMIR Formative Research\",\"volume\":\"9 \",\"pages\":\"e70107\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488165/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Formative Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/70107\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/70107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：最近的研究表明，大型语言模型在医学检查中表现出优异的表现。然而，缺乏评估它们在特定领域的能力或使用代码解释器在实际数据分析中的应用的报告。此外，不同大型语言模型之间的比较分析尚未广泛进行。目的：评估先进的人工智能（AI）模型是否能够分析基于模板输入的数据，并展示生殖医学的基础知识。通过数值计算和图形绘制对四种人工智能模型（GPT-4、gpt - 40、Claude 3.5 Sonnet和Gemini Pro 1.5）的数据分析能力进行了评估。他们对不孕症治疗的知识通过专家制定的10个考题进行评估。方法：首先，我们将数据上传到人工智能模型，并使用聊天界面提供指令模板。该研究调查了人工智能模型是否可以根据加德纳标准根据囊胚分级进行怀孕率分析和图形绘制。其次，我们评估了基于专业知识的模型诊断能力。这项评估使用了来自日本生育专家考试和胚胎学家认证考试的10个问题，以及染色体成像。这些材料是在经过认证的胚胎学家和生育专家的监督下整理的。每个AI模型重复所有步骤10次。结果：gpt - 40在10次试验中有9次达到A级输出（定义为通过一次输出尝试达到目标），优于GPT-4， GPT-4在10次试验中有7次达到A级。gpt - 40的数据分析平均处理时间为26.8 （SD 3.7）秒，GPT-4的数据分析平均处理时间为36.7 （SD 3）秒，而Claude的10次尝试都失败了。Gemini的平均处理时间为23秒（SD 3），在10次试验中有6次获得A级，尽管偶尔需要人工校正。胚胎学家完成同样的任务平均需要358.3秒（SD 9.7）。在基于知识的评估中，gpt - 40、Claude和Gemini在多项选择题中获得了满分（9/9），而GPT-4在一个问题上的成功率为60%（6/10）。没有一个AI模型能够可靠地从核型图像诊断染色体异常，Claude和Gemini的最高图像诊断准确率为70%（7/10）。结论：这种快速处理证明了这些人工智能模型在临床环境中显著加快数据密集型任务的潜力。这一表现强调了它们作为生殖医学教育工具或决策支持系统的潜在效用。然而，没有一个模型能够使用医学图像准确地解释和诊断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study.

查看原文本刊更多论文

Application of Large Language Models in Data Analysis and Medical Education for Assisted Reproductive Technology: Comparative Study.

Background: Recent studies have demonstrated that large language models exhibit exceptional performance in medical examinations. However, there is a lack of reports assessing their capabilities in specific domains or their application in practical data analysis using code interpreters. Furthermore, comparative analyses across different large language models have not been extensively conducted.

Objective: The purpose of this study was to evaluate whether advanced artificial intelligence (AI) models can analyze data from template-based input and demonstrate basic knowledge of reproductive medicine. Four AI models (GPT-4, GPT-4o, Claude 3.5 Sonnet, and Gemini Pro 1.5) were evaluated for their data analytical capabilities through numerical calculations and graph rendering. Their knowledge of infertility treatment was assessed using 10 examination questions developed by experts.

Methods: First, we uploaded data to the AI models and furnished instruction templates using the chat interface. The study investigated whether the AI models could perform pregnancy rate analysis and graph rendering, based on blastocyst grades according to Gardner criteria. Second, we assessed model diagnostic capabilities based on specialized knowledge. This evaluation used 10 questions derived from the Japanese Fertility Specialist Examination and the Embryologist Certification Exam, along with chromosome imaging. These materials were curated under the supervision of certified embryologists and fertility specialists. All procedures were repeated 10 times per AI model.

Results: GPT-4o achieved grade A output (defined as achieving the objective with a single output attempt) in 9 out of 10 trials, outperforming GPT-4, which achieved grade A in 7 out of 10. The average processing times for data analysis were 26.8 (SD 3.7) seconds for GPT-4o and 36.7 (SD 3) seconds for GPT-4, whereas Claude failed in all 10 attempts. Gemini achieved an average processing time of 23 (SD 3) seconds and received grade A in 6 out of 10 trials, though occasional manual corrections were needed. Embryologists required an average of 358.3 (SD 9.7) seconds for the same tasks. In the knowledge-based assessment, GPT-4o, Claude, and Gemini achieved perfect scores (9/9) on multiple-choice questions, while GPT-4 showed a 60% (6/10) success rate on 1 question. None of the AI models could reliably diagnose chromosomal abnormalities from karyotype images, with the highest image diagnostic accuracy being 70% (7/10) for Claude and Gemini.

Conclusions: This rapid processing demonstrates the potential for these AI models to significantly expedite data-intensive tasks in clinical settings. This performance underscores their potential utility as educational tools or decision support systems in reproductive medicine. However, none of the models were able to accurately interpret and diagnose using medical images.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊