Chatbots' performance in premature ejaculation questions: a comparative analysis of reliability, readability, and understandability.

IF 2.5 3区医学 Q2 UROLOGY & NEPHROLOGY

International Journal of Impotence Research Pub Date : 2025-09-24 DOI:10.1038/s41443-025-01179-3

S Gonultas, S Kardas, M Gelmis, A H Kinik, M Ozalevli, M G Kose, S Sulejman, S Yentur, B Arslan

{"title":"Chatbots' performance in premature ejaculation questions: a comparative analysis of reliability, readability, and understandability.","authors":"S Gonultas, S Kardas, M Gelmis, A H Kinik, M Ozalevli, M G Kose, S Sulejman, S Yentur, B Arslan","doi":"10.1038/s41443-025-01179-3","DOIUrl":null,"url":null,"abstract":"<p><p>This study aimed to evaluate the reliability, readability, and understandability of chatbot responses to frequently asked questions about premature ejaculation, and to assess the contributions, potential risks, and limitations of artificial intelligence. Fifteen questions were selected using data from Google Trends and posed to the chatbots Copilot, Gemini, ChatGPT4o, ChatGPT4oPlus, and DeepSeek-R1. Reliability was evaluated using the Global Quality Scale(GQS) by two experts, readability was assessed with the Flesch Kincaid Reading Ease(FKRE), Flesch Kincaid Grade Level(FKGL), Gunning Fog Index(GFI), and Simple Measure of Gobbledygook(SMOG), and understandability was evaluated using the Patient Educational Materials Assessment Tool for Printable Materials(PEMAT-P). Additionally, the consistency of source citations was examined. The GQS were as follows: Copilot: 3.96 ± 0.66, Gemini: 3.66 ± 0.78, ChatGPT4o: 4.83 ± 0.23, ChatGPT4oPlus: 4.83 ± 0.29, DeepSeek-R1:4.86 ± 0.22 (p < 0.001). The PEMAT-P were as follows: Copilot: 0.70 ± 0.05, Gemini: 0.72 ± 0.04, ChatGPT4o: 0.83 ± 0.03, ChatGPT4oPlus: 0.77 ± 0.06, DeepSeek-R1:0.79 ± 0.06 (p < 0.001). While ChatGPT4oPlus and DeepSeek-R1 scored higher for reliability and understandability, all chatbots performed at an acceptable level (≥70%). However, readability scores were above the recommended level for the target audience. Instances of low reliability or unverified sources were noted, with no significant differences between the chatbots. Chatbots provide highly reliable and informative responses regarding premature ejaculation; however, it is evident that there are significant limitations that require improvement, particularly concerning readability and the reliability of sources.</p>","PeriodicalId":14068,"journal":{"name":"International Journal of Impotence Research","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Impotence Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41443-025-01179-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This study aimed to evaluate the reliability, readability, and understandability of chatbot responses to frequently asked questions about premature ejaculation, and to assess the contributions, potential risks, and limitations of artificial intelligence. Fifteen questions were selected using data from Google Trends and posed to the chatbots Copilot, Gemini, ChatGPT4o, ChatGPT4oPlus, and DeepSeek-R1. Reliability was evaluated using the Global Quality Scale(GQS) by two experts, readability was assessed with the Flesch Kincaid Reading Ease(FKRE), Flesch Kincaid Grade Level(FKGL), Gunning Fog Index(GFI), and Simple Measure of Gobbledygook(SMOG), and understandability was evaluated using the Patient Educational Materials Assessment Tool for Printable Materials(PEMAT-P). Additionally, the consistency of source citations was examined. The GQS were as follows: Copilot: 3.96 ± 0.66, Gemini: 3.66 ± 0.78, ChatGPT4o: 4.83 ± 0.23, ChatGPT4oPlus: 4.83 ± 0.29, DeepSeek-R1:4.86 ± 0.22 (p < 0.001). The PEMAT-P were as follows: Copilot: 0.70 ± 0.05, Gemini: 0.72 ± 0.04, ChatGPT4o: 0.83 ± 0.03, ChatGPT4oPlus: 0.77 ± 0.06, DeepSeek-R1:0.79 ± 0.06 (p < 0.001). While ChatGPT4oPlus and DeepSeek-R1 scored higher for reliability and understandability, all chatbots performed at an acceptable level (≥70%). However, readability scores were above the recommended level for the target audience. Instances of low reliability or unverified sources were noted, with no significant differences between the chatbots. Chatbots provide highly reliable and informative responses regarding premature ejaculation; however, it is evident that there are significant limitations that require improvement, particularly concerning readability and the reliability of sources.

查看原文本刊更多论文

聊天机器人在早泄问题中的表现：可靠性、可读性和可理解性的比较分析。

本研究旨在评估聊天机器人对早泄常见问题的回答的可靠性、可读性和可理解性，并评估人工智能的贡献、潜在风险和局限性。从谷歌Trends的数据中选择了15个问题，并向聊天机器人Copilot、Gemini、chatgpt40、chatgpt40plus和DeepSeek-R1提出了问题。可靠性由两位专家采用全球质量量表（GQS）进行评估，可读性采用Flesch Kincaid阅读简易量表（FKRE）、Flesch Kincaid等级水平量表（FKGL）、Gunning Fog指数（GFI）和简单测量的Gobbledygook量表（SMOG）进行评估，可理解性采用可打印材料患者教育材料评估工具（PEMAT-P）进行评估。此外，还检查了源引文的一致性。GQS分别为：副驾驶：3.96±0.66，双子星：3.66±0.78,chatgpt40: 4.83±0.23,chatgpt40 +: 4.83±0.29,DeepSeek-R1:4.86±0.22

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Impotence Research 医学-泌尿学与肾脏学

CiteScore

4.90

自引率

19.20%

发文量

140

审稿时长

>12 weeks

期刊介绍： International Journal of Impotence Research: The Journal of Sexual Medicine addresses sexual medicine for both genders as an interdisciplinary field. This includes basic science researchers, urologists, endocrinologists, cardiologists, family practitioners, gynecologists, internists, neurologists, psychiatrists, psychologists, radiologists and other health care clinicians.