Unsupervised Deep Learning of Electronic Health Records to Characterize Heterogeneity Across Alzheimer Disease and Related Dementias: Cross-Sectional Study.
Matthew West, You Cheng, Yingnan He, Yu Leng, Colin Magdamo, Bradley T Hyman, John R Dickson, Alberto Serrano-Pozo, Deborah Blacker, Sudeshna Das
{"title":"Unsupervised Deep Learning of Electronic Health Records to Characterize Heterogeneity Across Alzheimer Disease and Related Dementias: Cross-Sectional Study.","authors":"Matthew West, You Cheng, Yingnan He, Yu Leng, Colin Magdamo, Bradley T Hyman, John R Dickson, Alberto Serrano-Pozo, Deborah Blacker, Sudeshna Das","doi":"10.2196/65178","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Alzheimer disease and related dementias (ADRD) exhibit prominent heterogeneity. Identifying clinically meaningful ADRD subtypes is essential for tailoring treatments to specific patient phenotypes.</p><p><strong>Objective: </strong>We aimed to use unsupervised learning techniques on electronic health records (EHRs) from memory clinic patients to identify ADRD subtypes.</p><p><strong>Methods: </strong>We used pretrained embeddings of non-ADRD diagnosis codes (International Classification of Diseases, Ninth Revision) and large language model (LLM)-derived embeddings of clinical notes from patient EHRs. Hierarchical clustering of these embeddings was used to identify ADRD subtypes. Clusters were characterized regarding their demographic and clinical features.</p><p><strong>Results: </strong>We analyzed a cohort of 3454 patients with ADRD from a memory clinic at Massachusetts General Hospital, each with a specialist diagnosis. Clustering pretrained embeddings of the non-ADRD diagnosis codes in patient EHRs revealed the following 3 patient subtypes: one with skin conditions, another with psychiatric disorders and an earlier age of onset, and a third with diabetes complications. Similarly, using LLM-derived embeddings of clinical notes, we identified 3 subtypes of patients as follows: one with psychiatric manifestations and higher prevalence of female participants (prevalence ratio: 1.59), another with cardiovascular and motor problems and higher prevalence of male participants (prevalence ratio: 1.75), and a third one with geriatric health disorders. Notably, we observed significant overlap between clusters from both data modalities (χ<sup>2</sup><sub>4</sub>=89.4; P<.001).</p><p><strong>Conclusions: </strong>By integrating International Classification of Diseases, Ninth Revision codes and LLM-derived embeddings, our analysis delineated 2 distinct ADRD subtypes with sex-specific comorbid and clinical presentations, offering insights for potential precision medicine approaches.</p>","PeriodicalId":36245,"journal":{"name":"JMIR Aging","volume":"8 ","pages":"e65178"},"PeriodicalIF":5.0000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Aging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/65178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GERIATRICS & GERONTOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Alzheimer disease and related dementias (ADRD) exhibit prominent heterogeneity. Identifying clinically meaningful ADRD subtypes is essential for tailoring treatments to specific patient phenotypes.
Objective: We aimed to use unsupervised learning techniques on electronic health records (EHRs) from memory clinic patients to identify ADRD subtypes.
Methods: We used pretrained embeddings of non-ADRD diagnosis codes (International Classification of Diseases, Ninth Revision) and large language model (LLM)-derived embeddings of clinical notes from patient EHRs. Hierarchical clustering of these embeddings was used to identify ADRD subtypes. Clusters were characterized regarding their demographic and clinical features.
Results: We analyzed a cohort of 3454 patients with ADRD from a memory clinic at Massachusetts General Hospital, each with a specialist diagnosis. Clustering pretrained embeddings of the non-ADRD diagnosis codes in patient EHRs revealed the following 3 patient subtypes: one with skin conditions, another with psychiatric disorders and an earlier age of onset, and a third with diabetes complications. Similarly, using LLM-derived embeddings of clinical notes, we identified 3 subtypes of patients as follows: one with psychiatric manifestations and higher prevalence of female participants (prevalence ratio: 1.59), another with cardiovascular and motor problems and higher prevalence of male participants (prevalence ratio: 1.75), and a third one with geriatric health disorders. Notably, we observed significant overlap between clusters from both data modalities (χ24=89.4; P<.001).
Conclusions: By integrating International Classification of Diseases, Ninth Revision codes and LLM-derived embeddings, our analysis delineated 2 distinct ADRD subtypes with sex-specific comorbid and clinical presentations, offering insights for potential precision medicine approaches.