High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck's disease.

IF 2.2 3区医学 Q2 ORTHOPEDICS

BMC Musculoskeletal Disorders Pub Date : 2024-11-04 DOI:10.1186/s12891-024-07983-0

Zeynel Mert Asfuroğlu, Hilal Yağar, Ender Gümüşoğlu

{"title":"High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck's disease.","authors":"Zeynel Mert Asfuroğlu, Hilal Yağar, Ender Gümüşoğlu","doi":"10.1186/s12891-024-07983-0","DOIUrl":null,"url":null,"abstract":"Background: This study aimed to assess the quality and readability of large language model-generated responses to frequently asked questions (FAQs) about Kienböck's disease (KD).Methods: Nineteen FAQs about KD were selected, and the questions were divided into three categories: general knowledge, diagnosis, and treatment. The questions were inputted into the Chat Generative Pre-trained Transformer 4 (ChatGPT4) webpage using the zero-shot prompting method, and the responses were recorded. Hand surgeons with at least 5 years of experience and advanced English proficiency were individually contacted over instant WhatsApp messaging and requested to assess the responses. The quality of each response was analyzed by 33 experienced hand surgeons using the Global Quality Scale (GQS). The readability was assessed with the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES).Results: The mean GQS score was 4.28 out of a maximum of 5 points. Most raters assessed the quality as good (270 of 627 responses; 43.1%) or excellent (260 of 627 responses; 41.5%). The mean FKGL was 15.5, and the mean FRES was 23.4, both of which are considered above the college graduate level. No statistically significant differences were found in the quality and readability of responses provided for questions related to general knowledge, diagnosis, and treatment.Conclusions: ChatGPT-4 provided high-quality responses to FAQs about KD. However, the primary drawback was the poor readability of these responses. By improving the readability of ChatGPT's output, we can transform it into a valuable information resource for individuals with KD.Level of evidence: Level IV, Observational study.","PeriodicalId":9189,"journal":{"name":"BMC Musculoskeletal Disorders","volume":"25 1","pages":"879"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536837/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Musculoskeletal Disorders","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12891-024-07983-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: This study aimed to assess the quality and readability of large language model-generated responses to frequently asked questions (FAQs) about Kienböck's disease (KD).

Methods: Nineteen FAQs about KD were selected, and the questions were divided into three categories: general knowledge, diagnosis, and treatment. The questions were inputted into the Chat Generative Pre-trained Transformer 4 (ChatGPT4) webpage using the zero-shot prompting method, and the responses were recorded. Hand surgeons with at least 5 years of experience and advanced English proficiency were individually contacted over instant WhatsApp messaging and requested to assess the responses. The quality of each response was analyzed by 33 experienced hand surgeons using the Global Quality Scale (GQS). The readability was assessed with the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES).

Results: The mean GQS score was 4.28 out of a maximum of 5 points. Most raters assessed the quality as good (270 of 627 responses; 43.1%) or excellent (260 of 627 responses; 41.5%). The mean FKGL was 15.5, and the mean FRES was 23.4, both of which are considered above the college graduate level. No statistically significant differences were found in the quality and readability of responses provided for questions related to general knowledge, diagnosis, and treatment.

Conclusions: ChatGPT-4 provided high-quality responses to FAQs about KD. However, the primary drawback was the poor readability of these responses. By improving the readability of ChatGPT's output, we can transform it into a valuable information resource for individuals with KD.

Level of evidence: Level IV, Observational study.

查看原文本刊更多论文

针对有关基恩博克病的常见问题，由大型语言模型生成的回答具有较高的准确性，但可读性有限。

背景：本研究旨在评估由大型语言模型生成的有关基恩博克病（KD）常见问题（FAQs）的回答质量和可读性：本研究旨在评估大语言模型生成的有关基恩博克病（KD）的常见问题（FAQs）回答的质量和可读性：选择了 19 个有关 KD 的常见问题，并将问题分为三类：常识、诊断和治疗。使用零点提示法将问题输入到 Chat Generative Pre-trained Transformer 4（ChatGPT4）网页中，并记录回答情况。我们通过即时 WhatsApp 消息单独联系了至少有 5 年经验且英语水平较高的手外科医生，并要求他们对回答进行评估。33 名经验丰富的手外科医生使用全球质量量表 (GQS) 分析了每个回答的质量。可读性采用弗莱什-金凯德等级评分（FKGL）和弗莱什阅读容易程度评分（FRES）进行评估：结果：GQS 平均分为 4.28 分（满分 5 分）。大多数评分者认为质量良好（627 份答卷中的 270 份；43.1%）或优秀（627 份答卷中的 260 份；41.5%）。FKGL 平均值为 15.5，FRES 平均值为 23.4，均高于大学毕业生水平。在常识、诊断和治疗相关问题的回答质量和可读性方面，没有发现明显的统计学差异：结论：ChatGPT-4 为有关 KD 的常见问题提供了高质量的回答。结论：ChatGPT-4 提供了高质量的 KD 常见问题回复，但其主要缺点是回复的可读性较差。通过提高 ChatGPT 输出的可读性，我们可以将其转化为对 KD 患者有价值的信息资源：证据等级：IV 级，观察性研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Musculoskeletal Disorders 医学-风湿病学

CiteScore

3.80

自引率

8.70%

发文量

1017

审稿时长

3-6 weeks

期刊介绍： BMC Musculoskeletal Disorders is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of musculoskeletal disorders, as well as related molecular genetics, pathophysiology, and epidemiology. The scope of the Journal covers research into rheumatic diseases where the primary focus relates specifically to a component(s) of the musculoskeletal system.