"Can We Trust Them?" An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes.

IF 9.4 1区医学 Q1 SPORT SCIENCES

Sports Medicine Pub Date : 2025-10-03 DOI:10.1007/s40279-025-02303-5

Jacopo Vitale,Alan McCall,Andrea Cina, ,Dina C Janse van Rensburg,Shona Halson

{"title":"\"Can We Trust Them?\" An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes.","authors":"Jacopo Vitale,Alan McCall,Andrea Cina, ,Dina C Janse van Rensburg,Shona Halson","doi":"10.1007/s40279-025-02303-5","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nWith the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes.\r\n\r\nOBJECTIVE\r\nThis study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes.\r\n\r\nMETHODS\r\nConducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests.\r\n\r\nRESULTS\r\nExperts' response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall.\r\n\r\nCONCLUSIONS\r\nThis study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.","PeriodicalId":21969,"journal":{"name":"Sports Medicine","volume":"126 1","pages":""},"PeriodicalIF":9.4000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sports Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40279-025-02303-5","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SPORT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

BACKGROUND With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes. OBJECTIVE This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes. METHODS Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests. RESULTS Experts' response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall. CONCLUSIONS This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.

查看原文本刊更多论文

“我们能信任他们吗？”大型语言模型的专家评估，为运动员提供睡眠和时差建议。

随着人工智能在医疗保健和体育科学中的应用越来越多，人们正在探索大型语言模型（llm）作为向运动员提供个性化、循证指导的工具。目的：本研究评估LLMs （ChatGPT-3.5、ChatGPT-4和b谷歌Bard）为运动员提供睡眠和时差的循证建议的能力。该研究于2024年1月至6月分两个阶段进行，首先根据专家和法学硕士的意见确定了这些主题的10个常见问题。在第二阶段，20名专家（平均年龄43.9±9.0岁，10名女性，10名男性）使用谷歌表格调查评估LLM反应，调查分两个间隔（T1和T2）进行。量表间信度采用Fleiss Kappa法，量表内信度采用JSI法，量表内信度采用内容效度比（CVR）法。llm之间的差异采用Friedman检验和卡方检验进行分析。结果专家应答率高（T1为100%，T2为95%）。评级者之间的信度最低（Fleiss Kappa: 0.21-0.39），而评级者内部的一致性很高，53%的专家达到了JSI≥0.75。ChatGPT-4具有最高的睡眠CVR(0.67)，并且是唯一具有有效时差CVR（0.68）的模型。b谷歌Bard的时差反应CVR最低（0%），与ChatGPT-3.5 （p = 0.0073）和ChatGPT-4 （p < 0.0001）相比有显著差异。时差反应不恰当的原因差异显著（p < 0.0001）， b谷歌Bard因信息不足和频繁错误而受到批评。ChatGPT-4的整体表现优于其他型号。结论：本研究强调了llm，特别是ChatGPT-4在提供基于证据的睡眠建议方面的潜力，但也强调了提高时差建议的准确性和有效性的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sports Medicine 医学-运动科学

CiteScore

18.40

自引率

5.10%

发文量

165

审稿时长

6-12 weeks

期刊介绍： Sports Medicine focuses on providing definitive and comprehensive review articles that interpret and evaluate current literature, aiming to offer insights into research findings in the sports medicine and exercise field. The journal covers major topics such as sports medicine and sports science, medical syndromes associated with sport and exercise, clinical medicine's role in injury prevention and treatment, exercise for rehabilitation and health, and the application of physiological and biomechanical principles to specific sports. Types of Articles: Review Articles: Definitive and comprehensive reviews that interpret and evaluate current literature to provide rationale for and application of research findings. Leading/Current Opinion Articles: Overviews of contentious or emerging issues in the field. Original Research Articles: High-quality research articles. Enhanced Features: Additional features like slide sets, videos, and animations aimed at increasing the visibility, readership, and educational value of the journal's content. Plain Language Summaries: Summaries accompanying articles to assist readers in understanding important medical advances. Peer Review Process: All manuscripts undergo peer review by international experts to ensure quality and rigor. The journal also welcomes Letters to the Editor, which will be considered for publication.