S Gonultas, S Kardas, M Gelmis, A H Kinik, M Ozalevli, M G Kose, S Sulejman, S Yentur, B Arslan
{"title":"聊天机器人在早泄问题中的表现:可靠性、可读性和可理解性的比较分析。","authors":"S Gonultas, S Kardas, M Gelmis, A H Kinik, M Ozalevli, M G Kose, S Sulejman, S Yentur, B Arslan","doi":"10.1038/s41443-025-01179-3","DOIUrl":null,"url":null,"abstract":"<p><p>This study aimed to evaluate the reliability, readability, and understandability of chatbot responses to frequently asked questions about premature ejaculation, and to assess the contributions, potential risks, and limitations of artificial intelligence. Fifteen questions were selected using data from Google Trends and posed to the chatbots Copilot, Gemini, ChatGPT4o, ChatGPT4oPlus, and DeepSeek-R1. Reliability was evaluated using the Global Quality Scale(GQS) by two experts, readability was assessed with the Flesch Kincaid Reading Ease(FKRE), Flesch Kincaid Grade Level(FKGL), Gunning Fog Index(GFI), and Simple Measure of Gobbledygook(SMOG), and understandability was evaluated using the Patient Educational Materials Assessment Tool for Printable Materials(PEMAT-P). Additionally, the consistency of source citations was examined. The GQS were as follows: Copilot: 3.96 ± 0.66, Gemini: 3.66 ± 0.78, ChatGPT4o: 4.83 ± 0.23, ChatGPT4oPlus: 4.83 ± 0.29, DeepSeek-R1:4.86 ± 0.22 (p < 0.001). The PEMAT-P were as follows: Copilot: 0.70 ± 0.05, Gemini: 0.72 ± 0.04, ChatGPT4o: 0.83 ± 0.03, ChatGPT4oPlus: 0.77 ± 0.06, DeepSeek-R1:0.79 ± 0.06 (p < 0.001). While ChatGPT4oPlus and DeepSeek-R1 scored higher for reliability and understandability, all chatbots performed at an acceptable level (≥70%). However, readability scores were above the recommended level for the target audience. Instances of low reliability or unverified sources were noted, with no significant differences between the chatbots. Chatbots provide highly reliable and informative responses regarding premature ejaculation; however, it is evident that there are significant limitations that require improvement, particularly concerning readability and the reliability of sources.</p>","PeriodicalId":14068,"journal":{"name":"International Journal of Impotence Research","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Chatbots' performance in premature ejaculation questions: a comparative analysis of reliability, readability, and understandability.\",\"authors\":\"S Gonultas, S Kardas, M Gelmis, A H Kinik, M Ozalevli, M G Kose, S Sulejman, S Yentur, B Arslan\",\"doi\":\"10.1038/s41443-025-01179-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This study aimed to evaluate the reliability, readability, and understandability of chatbot responses to frequently asked questions about premature ejaculation, and to assess the contributions, potential risks, and limitations of artificial intelligence. Fifteen questions were selected using data from Google Trends and posed to the chatbots Copilot, Gemini, ChatGPT4o, ChatGPT4oPlus, and DeepSeek-R1. Reliability was evaluated using the Global Quality Scale(GQS) by two experts, readability was assessed with the Flesch Kincaid Reading Ease(FKRE), Flesch Kincaid Grade Level(FKGL), Gunning Fog Index(GFI), and Simple Measure of Gobbledygook(SMOG), and understandability was evaluated using the Patient Educational Materials Assessment Tool for Printable Materials(PEMAT-P). Additionally, the consistency of source citations was examined. The GQS were as follows: Copilot: 3.96 ± 0.66, Gemini: 3.66 ± 0.78, ChatGPT4o: 4.83 ± 0.23, ChatGPT4oPlus: 4.83 ± 0.29, DeepSeek-R1:4.86 ± 0.22 (p < 0.001). The PEMAT-P were as follows: Copilot: 0.70 ± 0.05, Gemini: 0.72 ± 0.04, ChatGPT4o: 0.83 ± 0.03, ChatGPT4oPlus: 0.77 ± 0.06, DeepSeek-R1:0.79 ± 0.06 (p < 0.001). While ChatGPT4oPlus and DeepSeek-R1 scored higher for reliability and understandability, all chatbots performed at an acceptable level (≥70%). However, readability scores were above the recommended level for the target audience. Instances of low reliability or unverified sources were noted, with no significant differences between the chatbots. Chatbots provide highly reliable and informative responses regarding premature ejaculation; however, it is evident that there are significant limitations that require improvement, particularly concerning readability and the reliability of sources.</p>\",\"PeriodicalId\":14068,\"journal\":{\"name\":\"International Journal of Impotence Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Impotence Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41443-025-01179-3\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Impotence Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41443-025-01179-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
Chatbots' performance in premature ejaculation questions: a comparative analysis of reliability, readability, and understandability.
This study aimed to evaluate the reliability, readability, and understandability of chatbot responses to frequently asked questions about premature ejaculation, and to assess the contributions, potential risks, and limitations of artificial intelligence. Fifteen questions were selected using data from Google Trends and posed to the chatbots Copilot, Gemini, ChatGPT4o, ChatGPT4oPlus, and DeepSeek-R1. Reliability was evaluated using the Global Quality Scale(GQS) by two experts, readability was assessed with the Flesch Kincaid Reading Ease(FKRE), Flesch Kincaid Grade Level(FKGL), Gunning Fog Index(GFI), and Simple Measure of Gobbledygook(SMOG), and understandability was evaluated using the Patient Educational Materials Assessment Tool for Printable Materials(PEMAT-P). Additionally, the consistency of source citations was examined. The GQS were as follows: Copilot: 3.96 ± 0.66, Gemini: 3.66 ± 0.78, ChatGPT4o: 4.83 ± 0.23, ChatGPT4oPlus: 4.83 ± 0.29, DeepSeek-R1:4.86 ± 0.22 (p < 0.001). The PEMAT-P were as follows: Copilot: 0.70 ± 0.05, Gemini: 0.72 ± 0.04, ChatGPT4o: 0.83 ± 0.03, ChatGPT4oPlus: 0.77 ± 0.06, DeepSeek-R1:0.79 ± 0.06 (p < 0.001). While ChatGPT4oPlus and DeepSeek-R1 scored higher for reliability and understandability, all chatbots performed at an acceptable level (≥70%). However, readability scores were above the recommended level for the target audience. Instances of low reliability or unverified sources were noted, with no significant differences between the chatbots. Chatbots provide highly reliable and informative responses regarding premature ejaculation; however, it is evident that there are significant limitations that require improvement, particularly concerning readability and the reliability of sources.
期刊介绍:
International Journal of Impotence Research: The Journal of Sexual Medicine addresses sexual medicine for both genders as an interdisciplinary field. This includes basic science researchers, urologists, endocrinologists, cardiologists, family practitioners, gynecologists, internists, neurologists, psychiatrists, psychologists, radiologists and other health care clinicians.