Jacopo Vitale,Alan McCall,Andrea Cina, ,Dina C Janse van Rensburg,Shona Halson
{"title":"\"Can We Trust Them?\" An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes.","authors":"Jacopo Vitale,Alan McCall,Andrea Cina, ,Dina C Janse van Rensburg,Shona Halson","doi":"10.1007/s40279-025-02303-5","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nWith the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes.\r\n\r\nOBJECTIVE\r\nThis study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes.\r\n\r\nMETHODS\r\nConducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests.\r\n\r\nRESULTS\r\nExperts' response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall.\r\n\r\nCONCLUSIONS\r\nThis study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.","PeriodicalId":21969,"journal":{"name":"Sports Medicine","volume":"126 1","pages":""},"PeriodicalIF":9.4000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sports Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40279-025-02303-5","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SPORT SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND
With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes.
OBJECTIVE
This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes.
METHODS
Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests.
RESULTS
Experts' response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall.
CONCLUSIONS
This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.
期刊介绍:
Sports Medicine focuses on providing definitive and comprehensive review articles that interpret and evaluate current literature, aiming to offer insights into research findings in the sports medicine and exercise field. The journal covers major topics such as sports medicine and sports science, medical syndromes associated with sport and exercise, clinical medicine's role in injury prevention and treatment, exercise for rehabilitation and health, and the application of physiological and biomechanical principles to specific sports.
Types of Articles:
Review Articles: Definitive and comprehensive reviews that interpret and evaluate current literature to provide rationale for and application of research findings.
Leading/Current Opinion Articles: Overviews of contentious or emerging issues in the field.
Original Research Articles: High-quality research articles.
Enhanced Features: Additional features like slide sets, videos, and animations aimed at increasing the visibility, readership, and educational value of the journal's content.
Plain Language Summaries: Summaries accompanying articles to assist readers in understanding important medical advances.
Peer Review Process:
All manuscripts undergo peer review by international experts to ensure quality and rigor. The journal also welcomes Letters to the Editor, which will be considered for publication.