Sonish Sivarajkumar, Thomas Yu Chow Tam, Haneef Ahamed Mohammad, Samuel Viggiano, David Oniani, Shyam Visweswaran, Yanshan Wang
{"title":"利用自然语言处理技术从阿尔茨海默病患者的临床笔记中提取睡眠信息。","authors":"Sonish Sivarajkumar, Thomas Yu Chow Tam, Haneef Ahamed Mohammad, Samuel Viggiano, David Oniani, Shyam Visweswaran, Yanshan Wang","doi":"10.1093/jamia/ocae177","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.</p><p><strong>Materials and methods: </strong>A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.</p><p><strong>Results: </strong>The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).</p><p><strong>Discussion: </strong>Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.</p><p><strong>Conclusion: </strong>The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"2217-2227"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11413436/pdf/","citationCount":"0","resultStr":"{\"title\":\"Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.\",\"authors\":\"Sonish Sivarajkumar, Thomas Yu Chow Tam, Haneef Ahamed Mohammad, Samuel Viggiano, David Oniani, Shyam Visweswaran, Yanshan Wang\",\"doi\":\"10.1093/jamia/ocae177\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.</p><p><strong>Materials and methods: </strong>A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.</p><p><strong>Results: </strong>The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).</p><p><strong>Discussion: </strong>Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.</p><p><strong>Conclusion: </strong>The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.</p>\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"2217-2227\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11413436/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocae177\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae177","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
目标:阿尔茨海默病(AD)是美国最常见的痴呆症。睡眠是与生活方式相关的因素之一,已被证明对老年期最佳认知功能至关重要。然而,目前还缺乏对睡眠与老年痴呆症发病率之间关系的研究。开展此类研究的一个主要瓶颈是,获取睡眠信息的传统方法耗时长、效率低、不可扩展,而且仅限于患者的主观体验。我们的目标是从 AD 患者的临床记录中自动提取与睡眠相关的特定模式,如打鼾、打盹、睡眠质量差、白天嗜睡、夜醒、其他睡眠问题和睡眠持续时间。据推测,这些睡眠模式与 AD 的发病率有关,有助于深入了解睡眠与 AD 的发病和发展之间的关系:我们从匹兹堡大学医学中心(UPMC)的adSLEEP语料库中随机抽取了570份临床笔记文档进行人工标注,创建了一个金标准数据集,该语料库包含7266名AD患者的19.2万份去标识化临床笔记。我们开发了一种基于规则的自然语言处理(NLP)算法、机器学习模型和基于大型语言模型(LLM)的NLP算法,以便从金标准数据集中自动提取与睡眠相关的概念,包括打鼾、午睡、睡眠问题、睡眠质量差、白天嗜睡、夜醒和睡眠持续时间:注释数据集包含 482 名患者,主要为白人(89.2%)、平均年龄为 84.7 岁的老年人,其中女性占 64.1%,绝大多数为非西班牙裔或拉丁裔(94.6%)。在所有与睡眠相关的概念中,基于规则的 NLP 算法取得了最佳的 F1 性能。在阳性预测值(PPV)方面,基于规则的NLP算法在白天嗜睡(1.00)和睡眠时间(1.00)方面的PPV得分最高,而机器学习模型在小睡(0.95)和睡眠质量差(0.86)方面的PPV最高,微调后的LLAMA2在夜醒(0.93)和睡眠问题(0.89)方面的PPV最高:讨论:尽管临床笔记中很少记录睡眠信息,但所提出的基于规则的 NLP 算法和基于 LLM 的 NLP 算法仍然取得了可喜的成果。相比之下,基于机器学习的方法没有取得很好的结果,这是因为训练数据中睡眠信息的规模较小:结果表明,基于规则的 NLP 算法在所有睡眠概念上都取得了最佳性能。这项研究的重点是注意力缺失症患者的临床笔记,但也可扩展到其他疾病的一般睡眠信息提取。
Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.
Objectives: Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.
Materials and methods: A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.
Results: The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).
Discussion: Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.
Conclusion: The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.
期刊介绍:
JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.