Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.

IF 2 4区 医学 Q2 HEALTH CARE SCIENCES & SERVICES
Risk Management and Healthcare Policy Pub Date : 2025-08-18 eCollection Date: 2025-01-01 DOI:10.2147/RMHP.S531777
Danyue Xue, Sha Liao
{"title":"Evaluation of Three Large Language Models' Response Performances to Inquiries Regarding Post-Abortion Care in the Context of Chinese Language: A Comparative Analysis.","authors":"Danyue Xue, Sha Liao","doi":"10.2147/RMHP.S531777","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study aimed to evaluate the response performances of three large language models (LLMs) (ChatGPT, Kimi, and Ernie Bot) to inquiries regarding post-abortion care (PAC) in the context of the Chinese language.</p><p><strong>Methods: </strong>The data was collected in October 2024. Twenty questions concerning the necessity of contraception after induced abortion, the best time for contraception, choice of a contraceptive method, contraceptive effectiveness, and the potential impact of contraception on fertility were used in this study. Each question was asked three times in Chinese for each LLM. Three PAC consultants conducted the evaluations. A Likert scale was used to score the responses based on accuracy, relevance, completeness, clarity, and reliability.</p><p><strong>Results: </strong>The number of responses received \"good\" (a mean score > 4), \"average\" (3 < mean score ≤ 4), and \"poor\" (a mean score ≤ 3) in overall evaluation was 159 (88.30%), 19 (10.57%), and 2 (1.10%). No statistically significant differences were identified in the overall evaluation among the three LLMs (<i>P</i> = 0.352). The number of the responses evaluated as good for accuracy, relevance, completeness, clarity, and reliability were 87 (48.33%), 154 (85.53%), 136 (75.57%), 133 (73.87%), and 128 (71.10%), respectively. No statistically significant differences were identified in accuracy, relevance, completeness or clarity between the three LLMs. A statistically significant difference was identified in reliability (<i>P</i> < 0.001).</p><p><strong>Conclusion: </strong>The three LLMs performed well overall and showed great potential for application in PAC consultations. The accuracy of the LLMs' responses should be improved through continuous training and evaluation.</p>","PeriodicalId":56009,"journal":{"name":"Risk Management and Healthcare Policy","volume":"18 ","pages":"2731-2741"},"PeriodicalIF":2.0000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12372831/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Risk Management and Healthcare Policy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2147/RMHP.S531777","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: This study aimed to evaluate the response performances of three large language models (LLMs) (ChatGPT, Kimi, and Ernie Bot) to inquiries regarding post-abortion care (PAC) in the context of the Chinese language.

Methods: The data was collected in October 2024. Twenty questions concerning the necessity of contraception after induced abortion, the best time for contraception, choice of a contraceptive method, contraceptive effectiveness, and the potential impact of contraception on fertility were used in this study. Each question was asked three times in Chinese for each LLM. Three PAC consultants conducted the evaluations. A Likert scale was used to score the responses based on accuracy, relevance, completeness, clarity, and reliability.

Results: The number of responses received "good" (a mean score > 4), "average" (3 < mean score ≤ 4), and "poor" (a mean score ≤ 3) in overall evaluation was 159 (88.30%), 19 (10.57%), and 2 (1.10%). No statistically significant differences were identified in the overall evaluation among the three LLMs (P = 0.352). The number of the responses evaluated as good for accuracy, relevance, completeness, clarity, and reliability were 87 (48.33%), 154 (85.53%), 136 (75.57%), 133 (73.87%), and 128 (71.10%), respectively. No statistically significant differences were identified in accuracy, relevance, completeness or clarity between the three LLMs. A statistically significant difference was identified in reliability (P < 0.001).

Conclusion: The three LLMs performed well overall and showed great potential for application in PAC consultations. The accuracy of the LLMs' responses should be improved through continuous training and evaluation.

中文语境下三种大型语言模型对人工流产后护理咨询的响应性能评价:比较分析
背景:本研究旨在评估三种大型语言模型(ChatGPT、Kimi和Ernie Bot)在中文背景下对堕胎后护理(PAC)问题的回应表现。方法:数据采集时间为2024年10月。本研究涉及人工流产后避孕的必要性、最佳避孕时间、避孕方法的选择、避孕效果以及避孕对生育的潜在影响等20个问题。每个法学硕士的每个问题都用中文问了三次。三名项目评估委员会顾问进行了评价。使用李克特量表根据准确性、相关性、完整性、清晰度和可靠性对回答进行评分。结果:总体评价“好”(平均评分bbbb4)、“一般”(3 <平均评分≤4)、“差”(平均评分≤3)分别为159例(88.30%)、19例(10.57%)、2例(1.10%)。三种llm的综合评价差异无统计学意义(P = 0.352)。准确性、相关性、完整性、清晰度和信度评价为良好的回答数分别为87(48.33%)、154(85.53%)、136(75.57%)、133(73.87%)和128(71.10%)。三种llm在准确性、相关性、完整性或清晰度方面没有统计学上的显著差异。信度差异有统计学意义(P < 0.001)。结论:3种llm总体表现良好,在PAC会诊中具有较大的应用潜力。法学硕士回答的准确性需要通过持续的培训和评估来提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Risk Management and Healthcare Policy
Risk Management and Healthcare Policy Medicine-Public Health, Environmental and Occupational Health
CiteScore
6.20
自引率
2.90%
发文量
242
审稿时长
16 weeks
期刊介绍: Risk Management and Healthcare Policy is an international, peer-reviewed, open access journal focusing on all aspects of public health, policy and preventative measures to promote good health and improve morbidity and mortality in the population. Specific topics covered in the journal include: Public and community health Policy and law Preventative and predictive healthcare Risk and hazard management Epidemiology, detection and screening Lifestyle and diet modification Vaccination and disease transmission/modification programs Health and safety and occupational health Healthcare services provision Health literacy and education Advertising and promotion of health issues Health economic evaluations and resource management Risk Management and Healthcare Policy focuses on human interventional and observational research. The journal welcomes submitted papers covering original research, clinical and epidemiological studies, reviews and evaluations, guidelines, expert opinion and commentary, and extended reports. Case reports will only be considered if they make a valuable and original contribution to the literature. The journal does not accept study protocols, animal-based or cell line-based studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信