Comparative Evaluation of Teaching Plans on Prostate Cancer Generated by Various Large Language Models and a Human Expert

IF 2 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Engineering reports : open access Pub Date : 2025-08-04 DOI:10.1002/eng2.70303

Rong Wang, Yue Ding, Yajun Shen, Haiyong Liu, Ping Wang, Zhixiang Gao

{"title":"Comparative Evaluation of Teaching Plans on Prostate Cancer Generated by Various Large Language Models and a Human Expert","authors":"Rong Wang, Yue Ding, Yajun Shen, Haiyong Liu, Ping Wang, Zhixiang Gao","doi":"10.1002/eng2.70303","DOIUrl":null,"url":null,"abstract":"<p>Prostate cancer remains one of the most common malignancies affecting men globally, characterized by high morbidity and mortality rates. The complexity and variability of the disease necessitate diverse treatment strategies, ranging from active surveillance to more aggressive interventions such as radical prostatectomy, radiation therapy, and androgen deprivation therapy. This study investigates the potential of large language models (LLMs) in generating educational content for prostate cancer, focusing on the creation of teaching plans in both Chinese and English. Four LLMs—GPT-4 (OpenAI), Gemini 1.5 Pro (Google), Kimi AI (Microsoft), and Douban (ByteDance)—were evaluated against teaching plans developed by an experienced urology professor. A double-blind assessment by 25 urology faculty members using a standardized 10-point scale was employed to compare the quality of curriculum content, learning objectives, and outcomes. The results revealed that GPT-4 and Gemini 1.5 Pro outperformed Kimi AI and Douban, yet still lagged behind human-generated plans, particularly in Chinese. Statistical analyses indicated significant differences in the quality scores among the LLMs and the human experts, underscoring the necessity of integrating domain-specific knowledge into AI-generated content. This research highlights the promise and limitations of LLMs in medical education, suggesting that future developments should focus on hybrid models that combine artificial intelligence with human expertise to enhance educational efficacy.</p>","PeriodicalId":72922,"journal":{"name":"Engineering reports : open access","volume":"7 8","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.70303","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering reports : open access","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/eng2.70303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Prostate cancer remains one of the most common malignancies affecting men globally, characterized by high morbidity and mortality rates. The complexity and variability of the disease necessitate diverse treatment strategies, ranging from active surveillance to more aggressive interventions such as radical prostatectomy, radiation therapy, and androgen deprivation therapy. This study investigates the potential of large language models (LLMs) in generating educational content for prostate cancer, focusing on the creation of teaching plans in both Chinese and English. Four LLMs—GPT-4 (OpenAI), Gemini 1.5 Pro (Google), Kimi AI (Microsoft), and Douban (ByteDance)—were evaluated against teaching plans developed by an experienced urology professor. A double-blind assessment by 25 urology faculty members using a standardized 10-point scale was employed to compare the quality of curriculum content, learning objectives, and outcomes. The results revealed that GPT-4 and Gemini 1.5 Pro outperformed Kimi AI and Douban, yet still lagged behind human-generated plans, particularly in Chinese. Statistical analyses indicated significant differences in the quality scores among the LLMs and the human experts, underscoring the necessity of integrating domain-specific knowledge into AI-generated content. This research highlights the promise and limitations of LLMs in medical education, suggesting that future developments should focus on hybrid models that combine artificial intelligence with human expertise to enhance educational efficacy.

Abstract Image

查看原文本刊更多论文

各种大型语言模型与人类专家生成的前列腺癌教学计划的比较评价

前列腺癌仍然是影响全球男性的最常见恶性肿瘤之一，其特点是发病率和死亡率高。该疾病的复杂性和可变性需要多种治疗策略，从主动监测到更积极的干预措施，如根治性前列腺切除术、放射治疗和雄激素剥夺治疗。本研究探讨了大型语言模型（llm）在生成前列腺癌教育内容方面的潜力，重点是创建中英文教学计划。四个LLMs-GPT-4 (OpenAI), Gemini 1.5 Pro （b谷歌），Kimi AI（微软）和豆瓣（字节跳动）-根据一位经验丰富的泌尿学教授制定的教学计划进行评估。25名泌尿科教师采用标准化的10分制进行双盲评估，以比较课程内容、学习目标和结果的质量。结果显示，GPT-4和Gemini 1.5 Pro的表现优于Kimi AI和豆瓣，但仍落后于人类生成的计划，尤其是在中文方面。统计分析表明，法学硕士和人类专家之间的质量得分存在显著差异，强调了将特定领域知识整合到人工智能生成内容中的必要性。这项研究突出了法学硕士在医学教育中的前景和局限性，表明未来的发展应该集中在将人工智能与人类专业知识相结合的混合模型上，以提高教育效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊