比较基于大型语言模型的工具在患者驱动的青光眼查询中的性能。

IF 1.8 4区医学 Q2 OPHTHALMOLOGY

Journal of Glaucoma Pub Date : 2025-09-10 DOI:10.1097/IJG.0000000000002627

Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi

{"title":"比较基于大型语言模型的工具在患者驱动的青光眼查询中的性能。","authors":"Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi","doi":"10.1097/IJG.0000000000002627","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.","PeriodicalId":15938,"journal":{"name":"Journal of Glaucoma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.\",\"authors\":\"Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi\",\"doi\":\"10.1097/IJG.0000000000002627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.\",\"PeriodicalId\":15938,\"journal\":{\"name\":\"Journal of Glaucoma\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Glaucoma\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/IJG.0000000000002627\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Glaucoma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IJG.0000000000002627","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：大语言模型（Large language models, LLMs）可以帮助在线寻求医学知识的患者指导自己的青光眼护理。了解LLM在青光眼相关问题上的表现差异，可以告知患者获取相关信息的最佳资源。方法：本横断面研究评估了llm生成的青光眼查询应答的准确性、全面性、质量和可读性。2024年9月，患者在美国眼科学会的眼保健论坛上发布的7个问题被随机抽取，并被提示为gpt - 40、gpt - 40 Mini、Gemini Pro和Gemini Flash。四名眼科医生使用基于准确性、全面性和质量的李克特量表评估反应。Flesch-Kincaid等级水平测量可读性，而双向编码器表示从变形金刚（BERT）得分测量语义相似度的LLM响应。统计分析包括Kruskal-Wallis检验与Dunn事后检验或ANOVA分析与Tukey的诚实显著差异（HSD）检验。结果：与Gemini Pro相比，gpt - 40在准确性（P=0.016）、全面性（P=0.007）和质量（P=0.002）方面得分更高。gpt - 40 Mini在综合性（P=0.011）和质量（P=0.007）方面得分更高。Gemini Flash和Gemini Pro在所有标准上都是相似的。在可读性上没有差异，法学硕士大多产生语义上相似的反应。结论：GPT模型在解决青光眼常见问题方面优于Gemini Pro，为llm在提供健康信息方面的应用提供了有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.

Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.

Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Glaucoma 医学-眼科学

CiteScore

4.20

自引率

10.00%

发文量

330

审稿时长

4-8 weeks

期刊介绍： The Journal of Glaucoma is a peer reviewed journal addressing the spectrum of issues affecting definition, diagnosis, and management of glaucoma and providing a forum for lively and stimulating discussion of clinical, scientific, and socioeconomic factors affecting care of glaucoma patients.