Assessing the quality of Japanese online breast cancer treatment information using large language models: a comparison of ChatGPT, Claude, and expert evaluations.
{"title":"Assessing the quality of Japanese online breast cancer treatment information using large language models: a comparison of ChatGPT, Claude, and expert evaluations.","authors":"Atsushi Fushimi, Mitsuo Terada, Rie Tahara, Yuko Nakazawa, Madoka Iwase, Tomoko Shibayama, Samy Kotti, Nami Yamashita, Asumi Iesato","doi":"10.1007/s12282-025-01719-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The internet is a primary source of health information for breast cancer patients, but online content quality varies widely. This study aimed to evaluate the capability of large language models (LLMs), including ChatGPT and Claude, to assess the quality of online Japanese breast cancer treatment information by calculating and comparing their DISCERN scores with those of expert raters.</p><p><strong>Methods: </strong>We analyzed 60 Japanese web pages on breast cancer treatments (surgery, chemotherapy, immunotherapy) using the DISCERN instrument. Each page was evaluated by the LLMs ChatGPT and Claude, along with two expert raters. We assessed LLMs evaluation consistency, correlations between LLMs and expert assessments, and relationships between DISCERN scores, Google search rankings, and content length.</p><p><strong>Results: </strong>Evaluations by LLMs showed high consistency and moderate to strong correlations with expert assessments (ChatGPT vs Expert: r = 0.65; Claude vs Expert: r = 0.68). LLMs assigned slightly higher scores than expert raters. Chemotherapy pages received the highest quality scores, followed by surgery and immunotherapy. We found a weak negative correlation between Google search ranking and DISCERN scores, and a moderate positive correlation (r = 0.45) between content length and quality ratings.</p><p><strong>Conclusions: </strong>This study demonstrates the potential of LLM-assisted evaluation in assessing online health information quality, while highlighting the importance of human expertise. LLMs could efficiently process large volumes of health information but should complement human insight for comprehensive assessments. These findings have implications for improving the accessibility and reliability of breast cancer treatment information.</p>","PeriodicalId":520574,"journal":{"name":"Breast cancer (Tokyo, Japan)","volume":" ","pages":"960-969"},"PeriodicalIF":2.9000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Breast cancer (Tokyo, Japan)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12282-025-01719-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/21 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The internet is a primary source of health information for breast cancer patients, but online content quality varies widely. This study aimed to evaluate the capability of large language models (LLMs), including ChatGPT and Claude, to assess the quality of online Japanese breast cancer treatment information by calculating and comparing their DISCERN scores with those of expert raters.
Methods: We analyzed 60 Japanese web pages on breast cancer treatments (surgery, chemotherapy, immunotherapy) using the DISCERN instrument. Each page was evaluated by the LLMs ChatGPT and Claude, along with two expert raters. We assessed LLMs evaluation consistency, correlations between LLMs and expert assessments, and relationships between DISCERN scores, Google search rankings, and content length.
Results: Evaluations by LLMs showed high consistency and moderate to strong correlations with expert assessments (ChatGPT vs Expert: r = 0.65; Claude vs Expert: r = 0.68). LLMs assigned slightly higher scores than expert raters. Chemotherapy pages received the highest quality scores, followed by surgery and immunotherapy. We found a weak negative correlation between Google search ranking and DISCERN scores, and a moderate positive correlation (r = 0.45) between content length and quality ratings.
Conclusions: This study demonstrates the potential of LLM-assisted evaluation in assessing online health information quality, while highlighting the importance of human expertise. LLMs could efficiently process large volumes of health information but should complement human insight for comprehensive assessments. These findings have implications for improving the accessibility and reliability of breast cancer treatment information.
背景:互联网是乳腺癌患者健康信息的主要来源,但在线内容质量参差不齐。本研究旨在评估大型语言模型(llm)的能力,包括ChatGPT和Claude,通过计算和比较他们的辨别分数与专家评分者的分数来评估在线日本乳腺癌治疗信息的质量。方法:我们使用DISCERN仪器分析60个关于乳腺癌治疗(手术、化疗、免疫治疗)的日本网页。每页都由法学硕士ChatGPT和Claude以及两位专家评分员进行评估。我们评估了法学硕士评估的一致性、法学硕士与专家评估之间的相关性,以及DISCERN分数、谷歌搜索排名和内容长度之间的关系。结果:法学硕士的评价与专家评价具有高一致性和中强相关性(ChatGPT vs expert: r = 0.65;Claude vs Expert: r = 0.68)。法学硕士的评分略高于专家评分者。化疗页面的质量得分最高,其次是手术和免疫治疗。我们发现谷歌搜索排名与DISCERN评分之间存在弱负相关,内容长度与质量评分之间存在中度正相关(r = 0.45)。结论:本研究证明了法学硕士辅助评估在评估在线健康信息质量方面的潜力,同时强调了人类专业知识的重要性。法学硕士可以有效地处理大量健康信息,但应该补充人类的洞察力进行全面评估。这些发现有助于提高乳腺癌治疗信息的可及性和可靠性。