Christopher Seifen, Tilman Huppertz, Katharina Bahr-Hamm, Haralampos Gouveris, Johannes Pordzik, Jonas Eckrich, Christoph Matthias, Harry Smith, Tom Kelsey, Andrew Blaikie, Sebastian Kuhn, Christoph Raphael Buhr
{"title":"Evaluating Locally Run Large Language Models for Obstructive Sleep Apnea Diagnosis and Treatment: A Real-World Polysomnography Study.","authors":"Christopher Seifen, Tilman Huppertz, Katharina Bahr-Hamm, Haralampos Gouveris, Johannes Pordzik, Jonas Eckrich, Christoph Matthias, Harry Smith, Tom Kelsey, Andrew Blaikie, Sebastian Kuhn, Christoph Raphael Buhr","doi":"10.2147/NSS.S536823","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Sleep medicine is a highly resource-intensive field where large language models (LLMs) could offer a promising solution by supporting diagnostic processes. As web-based LLMs have obvious data protection constraints, locally run LLMs are essential for clinical implementation. This study is the first to investigate the performance of locally run LLMs in the interpretation of real-world polysomnographic (PSG) results.</p><p><strong>Methods: </strong>We randomly selected N=30 patients (18 male, 12 female, mean age 50.5 ± 11.1 years, mean body mass index 29.7 ± 5.5 kg/m², mean apnea hypopnea index 30.9 ± 23.8) from the clinical database of our sleep laboratory who underwent PSG due to clinical complaints typical of obstructive sleep apnea (OSA). The board-certified sleep physician's interpretations of diagnosis, suitable first-line therapy or alternative therapy were compared with those of three locally run LLMs (Gemma2, Llama3 and Mistral Nemo) assessing the level of concordance.</p><p><strong>Results: </strong>Gemma2 showed the lowest concordance of 33% (10/30 patients) with the board-certified sleep physician regarding OSA severity, followed by Mistral Nemo at 47% (14/30 patients) and Llama3 at 50% (15/30 patients). For automatic positive airway pressure (aPAP) recommendations, Mistral Nemo showed the highest concordance at 90% (27/30 patients), followed by Gemma2 and Llama3 with 83% (25/30 patients) each.</p><p><strong>Conclusion: </strong>Although locally run LLMs bypass data security constraints and show promising potential for clinical practice, their performance needs significant improvement prior to real-world implementation. Therefore, at present, the routine implementation of locally run LLMs in sleep medicine needs more refinement and fine tuning before they can be used for interpretation of real-world PSG results.</p>","PeriodicalId":18896,"journal":{"name":"Nature and Science of Sleep","volume":"17 ","pages":"1587-1599"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257169/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature and Science of Sleep","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2147/NSS.S536823","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Sleep medicine is a highly resource-intensive field where large language models (LLMs) could offer a promising solution by supporting diagnostic processes. As web-based LLMs have obvious data protection constraints, locally run LLMs are essential for clinical implementation. This study is the first to investigate the performance of locally run LLMs in the interpretation of real-world polysomnographic (PSG) results.
Methods: We randomly selected N=30 patients (18 male, 12 female, mean age 50.5 ± 11.1 years, mean body mass index 29.7 ± 5.5 kg/m², mean apnea hypopnea index 30.9 ± 23.8) from the clinical database of our sleep laboratory who underwent PSG due to clinical complaints typical of obstructive sleep apnea (OSA). The board-certified sleep physician's interpretations of diagnosis, suitable first-line therapy or alternative therapy were compared with those of three locally run LLMs (Gemma2, Llama3 and Mistral Nemo) assessing the level of concordance.
Results: Gemma2 showed the lowest concordance of 33% (10/30 patients) with the board-certified sleep physician regarding OSA severity, followed by Mistral Nemo at 47% (14/30 patients) and Llama3 at 50% (15/30 patients). For automatic positive airway pressure (aPAP) recommendations, Mistral Nemo showed the highest concordance at 90% (27/30 patients), followed by Gemma2 and Llama3 with 83% (25/30 patients) each.
Conclusion: Although locally run LLMs bypass data security constraints and show promising potential for clinical practice, their performance needs significant improvement prior to real-world implementation. Therefore, at present, the routine implementation of locally run LLMs in sleep medicine needs more refinement and fine tuning before they can be used for interpretation of real-world PSG results.
期刊介绍:
Nature and Science of Sleep is an international, peer-reviewed, open access journal covering all aspects of sleep science and sleep medicine, including the neurophysiology and functions of sleep, the genetics of sleep, sleep and society, biological rhythms, dreaming, sleep disorders and therapy, and strategies to optimize healthy sleep.
Specific topics covered in the journal include:
The functions of sleep in humans and other animals
Physiological and neurophysiological changes with sleep
The genetics of sleep and sleep differences
The neurotransmitters, receptors and pathways involved in controlling both sleep and wakefulness
Behavioral and pharmacological interventions aimed at improving sleep, and improving wakefulness
Sleep changes with development and with age
Sleep and reproduction (e.g., changes across the menstrual cycle, with pregnancy and menopause)
The science and nature of dreams
Sleep disorders
Impact of sleep and sleep disorders on health, daytime function and quality of life
Sleep problems secondary to clinical disorders
Interaction of society with sleep (e.g., consequences of shift work, occupational health, public health)
The microbiome and sleep
Chronotherapy
Impact of circadian rhythms on sleep, physiology, cognition and health
Mechanisms controlling circadian rhythms, centrally and peripherally
Impact of circadian rhythm disruptions (including night shift work, jet lag and social jet lag) on sleep, physiology, cognition and health
Behavioral and pharmacological interventions aimed at reducing adverse effects of circadian-related sleep disruption
Assessment of technologies and biomarkers for measuring sleep and/or circadian rhythms
Epigenetic markers of sleep or circadian disruption.