评估中年健康问题回答的准确性和可读性：六个大型语言模型聊天机器人的比较分析。

IF 1 Q3 OBSTETRICS & GYNECOLOGY

Journal of Mid-life Health Pub Date : 2025-01-01 Epub Date: 2025-04-05 DOI:10.4103/jmh.jmh_182_24

Himel Mondal, Devendra Nath Tiu, Shaikat Mondal, Rajib Dutta, Avijit Naskar, Indrashis Podder

{"title":"评估中年健康问题回答的准确性和可读性：六个大型语言模型聊天机器人的比较分析。","authors":"Himel Mondal, Devendra Nath Tiu, Shaikat Mondal, Rajib Dutta, Avijit Naskar, Indrashis Podder","doi":"10.4103/jmh.jmh_182_24","DOIUrl":null,"url":null,"abstract":"Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women.Methods: Twenty questions on midlife health were asked to six different LLM chatbots - ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot's responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population.Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education.Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men's and women's midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.","PeriodicalId":37717,"journal":{"name":"Journal of Mid-life Health","volume":"16 1","pages":"45-50"},"PeriodicalIF":1.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12052287/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots.\",\"authors\":\"Himel Mondal, Devendra Nath Tiu, Shaikat Mondal, Rajib Dutta, Avijit Naskar, Indrashis Podder\",\"doi\":\"10.4103/jmh.jmh_182_24\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women.Methods: Twenty questions on midlife health were asked to six different LLM chatbots - ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot's responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population.Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education.Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men's and women's midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.\",\"PeriodicalId\":37717,\"journal\":{\"name\":\"Journal of Mid-life Health\",\"volume\":\"16 1\",\"pages\":\"45-50\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12052287/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Mid-life Health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4103/jmh.jmh_182_24\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/5 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"OBSTETRICS & GYNECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Mid-life Health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/jmh.jmh_182_24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/5 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（LLM）聊天机器人在健康相关查询中的使用越来越多，因为它们的便利性和可访问性。然而，对其信息的准确性和可读性的关注仍然存在。许多人，包括病人和健康的成年人，可能会依靠聊天机器人来咨询中年健康问题，而不是咨询医生。在这种情况下，我们评估了六个LLM聊天机器人对男性和女性中年健康问题的回答的准确性和可读性。方法：对6个不同的LLM聊天机器人——ChatGPT、Claude、Copilot、Gemini、Meta人工智能（AI）和Perplexity，提出20个关于中年健康的问题。每个聊天机器人的回答由三位独立的专家医生收集并评估其准确性、相关性、流畅性和连贯性。总分也是通过取四个标准的平均值来计算的。此外，使用Flesch-Kincaid Grade Level对可读性进行了分析，以确定一般人群理解信息的容易程度。结果：在流利性方面，Perplexity得分最高（4.3±1.78），coherence得分最高（4.26±0.16），准确性得分最高（4.35±0.24），关联性得分最高（4.35±0.24）。总体而言，Meta AI得分最高（4.28±0.16），ChatGPT得分次之（4.22±0.21），Copilot得分最低（3.72±0.19）（P < 0.0001）。《困惑》的易读性得分最高（41.24±10.57），年级得分最低（11.11±1.93），是最容易阅读的，对教育水平要求较低。结论：LLM聊天机器人能够以不同的能力回答与中年相关的健康问题。Meta AI被发现是解决男性和女性中年健康问题的得分最高的聊天机器人，而Perplexity则提供了高可读性的可访问信息。因此，LLM聊天机器人可以作为中年健康的教育工具，根据其能力选择合适的聊天机器人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots.

查看原文本刊更多论文

Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots.

Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women.

Methods: Twenty questions on midlife health were asked to six different LLM chatbots - ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot's responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population.

Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education.

Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men's and women's midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Mid-life Health Social Sciences-Health (social science)

CiteScore

1.70

自引率

9.10%

发文量

审稿时长

43 weeks

期刊介绍： Journal of mid-life health is the official journal of the Indian Menopause society published Quarterly in January, April, July and October. It is peer reviewed, scientific journal of mid-life health and its problems. It includes all aspects of mid-life health, preventive as well as curative. The journal publishes on subjects such as gynecology, neurology, geriatrics, psychiatry, endocrinology, urology, andrology, psychology, healthy ageing, cardiovascular health, bone health, quality of life etc. as relevant of men and women in their midlife. The Journal provides a visible platform to the researchers as well as clinicians to publish their experiences in this area thereby helping in the promotion of mid-life health leading to healthy ageing, growing need due to increasing life expectancy. The Editorial team has maintained high standards and published original research papers, case reports and review articles from the best of the best contributors both national & international, consistently so that now, it has become a great tool in the hands of menopause practitioners.