Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.

IF 1.3 4区医学 Q4 IMMUNOLOGY

International Journal of STD & AIDS Pub Date : 2025-09-04 DOI:10.1177/09564624251372369

Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak

{"title":"Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.","authors":"Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak","doi":"10.1177/09564624251372369","DOIUrl":null,"url":null,"abstract":"BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (p < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (p < .001). Regarding reliability, Claude 3.7 Sonnet achieved \"good\" quality on DISCERN, while others were rated \"moderate\" (p = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as \"good quality with minor limitations,\" whereas Gemini Advanced 2.0 Flash (41.2) was rated \"low quality\" (p = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.","PeriodicalId":14408,"journal":{"name":"International Journal of STD & AIDS","volume":" ","pages":"9564624251372369"},"PeriodicalIF":1.3000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of STD & AIDS","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09564624251372369","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMMUNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (p < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (p < .001). Regarding reliability, Claude 3.7 Sonnet achieved "good" quality on DISCERN, while others were rated "moderate" (p = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as "good quality with minor limitations," whereas Gemini Advanced 2.0 Flash (41.2) was rated "low quality" (p = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.

查看原文本刊更多论文

人工智能与艾滋病教育：比较三种大型语言模型的准确性、可读性和可靠性。

本研究比较了三种大型语言模型（llm）在回答常见HIV问题方面的表现，考虑到它们在患者教育中的准确性和可靠性。方法模型回答63个HIV问题。评估准确性（5点Likert）、可读性（Flesch-Kincaid、Gunning Fog、Coleman-Liau）和可靠性（DISCERN、EQIP）。结果claude 3.7 Sonnet的准确率（4.54±0.44）显著高于chatgpt - 40（4.29±0.49）和Gemini Advanced 2.0 Flash(4.31±0.50)（p < 0.001）。chatgpt - 40在疾病定义、随访和传播途径方面的准确性较低，而Gemini Advanced 2.0 Flash在日常生活和治疗相关问题上表现较差。可读性分析表明，根据Flesch-Kincaid和Coleman-Liau指数，chatgpt - 40产生了最易理解的内容，而根据Gunning Fog标准，Claude 3.7 Sonnet最容易理解。Gemini Advanced 2.0 Flash在所有可读性测量中始终生成更复杂的文本（p < 0.001）。在可靠性方面，Claude 3.7 Sonnet在DISCERN上获得“良好”质量，而其他Sonnet被评为“中等”（p = .059）。在EQIP上，Claude 3.7 Sonnet（中位数为61.8）和chatgpt - 40（中位数为55.3）被评为“质量好，有轻微限制”，而Gemini Advanced 2.0 Flash（中位数为41.2）被评为“质量低”（p = 0.049）。结论claude 3.7 Sonnet具有较好的准确性和可靠性，chatgpt - 40具有较好的可读性。选择HIV教育法学硕士课程应考虑准确性、可读性和可靠性，强调定期评估内容质量和文化敏感性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of STD & AIDS 医学-传染病学

CiteScore

2.60

自引率

7.10%

发文量

144

审稿时长

3-6 weeks

期刊介绍： The International Journal of STD & AIDS provides a clinically oriented forum for investigating and treating sexually transmissible infections, HIV and AIDS. Publishing original research and practical papers, the journal contains in-depth review articles, short papers, case reports, audit reports, CPD papers and a lively correspondence column. This journal is a member of the Committee on Publication Ethics (COPE).