Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak
{"title":"人工智能与艾滋病教育:比较三种大型语言模型的准确性、可读性和可靠性。","authors":"Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak","doi":"10.1177/09564624251372369","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (<i>p</i> < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (<i>p</i> < .001). Regarding reliability, Claude 3.7 Sonnet achieved \"good\" quality on DISCERN, while others were rated \"moderate\" (<i>p</i> = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as \"good quality with minor limitations,\" whereas Gemini Advanced 2.0 Flash (41.2) was rated \"low quality\" (<i>p</i> = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.</p>","PeriodicalId":14408,"journal":{"name":"International Journal of STD & AIDS","volume":" ","pages":"9564624251372369"},"PeriodicalIF":1.3000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.\",\"authors\":\"Özge Eren Korkmaz, Burcu Açıkalın Arıkan, Selda Sayın Kutlu, Figen Kaptan Aydoğmuş, Nurbanu Sezak\",\"doi\":\"10.1177/09564624251372369\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (<i>p</i> < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (<i>p</i> < .001). Regarding reliability, Claude 3.7 Sonnet achieved \\\"good\\\" quality on DISCERN, while others were rated \\\"moderate\\\" (<i>p</i> = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as \\\"good quality with minor limitations,\\\" whereas Gemini Advanced 2.0 Flash (41.2) was rated \\\"low quality\\\" (<i>p</i> = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.</p>\",\"PeriodicalId\":14408,\"journal\":{\"name\":\"International Journal of STD & AIDS\",\"volume\":\" \",\"pages\":\"9564624251372369\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of STD & AIDS\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/09564624251372369\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"IMMUNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of STD & AIDS","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09564624251372369","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMMUNOLOGY","Score":null,"Total":0}
Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability.
BackgroundThis study compares three large language models (LLMs) in answering common HIV questions, given ongoing concerns about their accuracy and reliability in patient education.MethodsModels answered 63 HIV questions. Accuracy (5-point Likert), readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau), and reliability (DISCERN, EQIP) were assessed.ResultsClaude 3.7 Sonnet showed significantly higher accuracy (4.54 ± 0.44) compared to ChatGPT-4o (4.29 ± 0.49) and Gemini Advanced 2.0 Flash (4.31 ± 0.50) (p < .001). ChatGPT-4o had lower accuracy in disease definition, follow-up, and transmission routes, while Gemini Advanced 2.0 Flash performed poorly in daily life and treatment-related questions. Readability analyses indicated ChatGPT-4o produced the most accessible content according to Flesch-Kincaid and Coleman-Liau indices, whereas Claude 3.7 Sonnet was most comprehensible by Gunning Fog standards. Gemini Advanced 2.0 Flash consistently generated more complex texts across all readability measures (p < .001). Regarding reliability, Claude 3.7 Sonnet achieved "good" quality on DISCERN, while others were rated "moderate" (p = .059). On EQIP, Claude 3.7 Sonnet (median 61.8) and ChatGPT-4o (55.3) were classified as "good quality with minor limitations," whereas Gemini Advanced 2.0 Flash (41.2) was rated "low quality" (p = .049).ConclusionsClaude 3.7 Sonnet is preferable for accuracy and reliability, while ChatGPT-4o offers superior readability. Selecting LLMs for HIV education should consider accuracy, readability, and reliability, emphasizing regular assessment of content quality and cultural sensitivity.
期刊介绍:
The International Journal of STD & AIDS provides a clinically oriented forum for investigating and treating sexually transmissible infections, HIV and AIDS. Publishing original research and practical papers, the journal contains in-depth review articles, short papers, case reports, audit reports, CPD papers and a lively correspondence column. This journal is a member of the Committee on Publication Ethics (COPE).