Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.

IF 1.3 4区 医学 Q4 IMMUNOLOGY
Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak
{"title":"Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.","authors":"Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak","doi":"10.1177/09564624251372369","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (<i>p</i> < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (<i>p</i> < .001). Regarding reliability, Claude 3.7 Sonnet achieved \"good\" quality on DISCERN, while others were rated \"moderate\" (<i>p</i> = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as \"good quality with minor limitations,\" whereas Gemini Advanced 2.0 Flash (41.2) was rated \"low quality\" (<i>p</i> = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.</p>","PeriodicalId":14408,"journal":{"name":"International Journal of STD & AIDS","volume":" ","pages":"9564624251372369"},"PeriodicalIF":1.3000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of STD & AIDS","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09564624251372369","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMMUNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (p < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (p < .001). Regarding reliability, Claude 3.7 Sonnet achieved "good" quality on DISCERN, while others were rated "moderate" (p = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as "good quality with minor limitations," whereas Gemini Advanced 2.0 Flash (41.2) was rated "low quality" (p = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.

人工智能与艾滋病教育:比较三种大型语言模型的准确性、可读性和可靠性。
本研究比较了三种大型语言模型(llm)在回答常见HIV问题方面的表现,考虑到它们在患者教育中的准确性和可靠性。方法模型回答63个HIV问题。评估准确性(5点Likert)、可读性(Flesch-Kincaid、Gunning Fog、Coleman-Liau)和可靠性(DISCERN、EQIP)。结果claude 3.7 Sonnet的准确率(4.54±0.44)显著高于chatgpt - 40(4.29±0.49)和Gemini Advanced 2.0 Flash(4.31±0.50)(p < 0.001)。chatgpt - 40在疾病定义、随访和传播途径方面的准确性较低,而Gemini Advanced 2.0 Flash在日常生活和治疗相关问题上表现较差。可读性分析表明,根据Flesch-Kincaid和Coleman-Liau指数,chatgpt - 40产生了最易理解的内容,而根据Gunning Fog标准,Claude 3.7 Sonnet最容易理解。Gemini Advanced 2.0 Flash在所有可读性测量中始终生成更复杂的文本(p < 0.001)。在可靠性方面,Claude 3.7 Sonnet在DISCERN上获得“良好”质量,而其他Sonnet被评为“中等”(p = .059)。在EQIP上,Claude 3.7 Sonnet(中位数为61.8)和chatgpt - 40(中位数为55.3)被评为“质量好,有轻微限制”,而Gemini Advanced 2.0 Flash(中位数为41.2)被评为“质量低”(p = 0.049)。结论claude 3.7 Sonnet具有较好的准确性和可靠性,chatgpt - 40具有较好的可读性。选择HIV教育法学硕士课程应考虑准确性、可读性和可靠性,强调定期评估内容质量和文化敏感性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.60
自引率
7.10%
发文量
144
审稿时长
3-6 weeks
期刊介绍: The International Journal of STD & AIDS provides a clinically oriented forum for investigating and treating sexually transmissible infections, HIV and AIDS. Publishing original research and practical papers, the journal contains in-depth review articles, short papers, case reports, audit reports, CPD papers and a lively correspondence column. This journal is a member of the Committee on Publication Ethics (COPE).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信