William Baskett , Benjamin Black , Adnan I. Qureshi , Chi-Ren Shyu
{"title":"Identifying homogenous patient subgroups using transformer based hierarchical clustering of heterogeneous Mixed-Modality medical data","authors":"William Baskett , Benjamin Black , Adnan I. Qureshi , Chi-Ren Shyu","doi":"10.1016/j.jbi.2025.104878","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Patients are highly heterogeneous, with varying needs and responses to treatment. Identifying clinically homogenous patient subgroups is critical to improve personalized care. Patient records are often heterogeneous, may include multiple modalities which conventionally require separate data processing considerations, and are often incomplete, leading to difficulties in identifying meaningful clusters of patients.</div></div><div><h3>Methods</h3><div>We introduce a Med-ROAR, a transformer-based Random Order AutoRegressive (ROAR) embedding model for medical data. Med-ROAR hierarchically clusters data by encoding it into hierarchical discrete embeddings using a modified self-attention operation to facilitate random order mixed modality autoregressive modeling. This allows the model to accept arbitrary mixes of record types without special considerations. We compare our method’s clustering effectiveness to standard agglomerative clustering using 147,469 individuals diagnosed with Autism Spectrum Disorder (ASD). We also evaluate its use on data with mixed modalities and its resilience to missing information using 50,458 clinical records from Intensive Care Unit (ICU) patients which include both tabular and time-series components.</div></div><div><h3>Results</h3><div>We demonstrate that Med-ROAR is more likely to discover more cohesive high-level clusters than distance-based methods like agglomerative clustering. Our exploratory analysis of the autism data identifies clinically meaningful patterns of phenotypes within ASD. We identify homogenous, but atypical, patient subgroups within the ASD population. We also demonstrate Med-ROAR’s effectiveness in clustering patients using mixes of both tabular and time series clinical records from ICU patients. We demonstrate that Med-ROAR can predict patient subgroups even using incomplete, preliminary information collected shortly after admission.</div></div><div><h3>Conclusion</h3><div>Med-ROAR is a flexible hierarchical clustering technique which learns to cluster patients based on learned high-level semantic similarities rather than rule-based metrics. It can accept whatever patient data may be available without modification to the underlying model architecture. The data modalities which Med-ROAR can accept are primarily constrained by computational resources, rather than architectural limitations.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104878"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425001078","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
Patients are highly heterogeneous, with varying needs and responses to treatment. Identifying clinically homogenous patient subgroups is critical to improve personalized care. Patient records are often heterogeneous, may include multiple modalities which conventionally require separate data processing considerations, and are often incomplete, leading to difficulties in identifying meaningful clusters of patients.
Methods
We introduce a Med-ROAR, a transformer-based Random Order AutoRegressive (ROAR) embedding model for medical data. Med-ROAR hierarchically clusters data by encoding it into hierarchical discrete embeddings using a modified self-attention operation to facilitate random order mixed modality autoregressive modeling. This allows the model to accept arbitrary mixes of record types without special considerations. We compare our method’s clustering effectiveness to standard agglomerative clustering using 147,469 individuals diagnosed with Autism Spectrum Disorder (ASD). We also evaluate its use on data with mixed modalities and its resilience to missing information using 50,458 clinical records from Intensive Care Unit (ICU) patients which include both tabular and time-series components.
Results
We demonstrate that Med-ROAR is more likely to discover more cohesive high-level clusters than distance-based methods like agglomerative clustering. Our exploratory analysis of the autism data identifies clinically meaningful patterns of phenotypes within ASD. We identify homogenous, but atypical, patient subgroups within the ASD population. We also demonstrate Med-ROAR’s effectiveness in clustering patients using mixes of both tabular and time series clinical records from ICU patients. We demonstrate that Med-ROAR can predict patient subgroups even using incomplete, preliminary information collected shortly after admission.
Conclusion
Med-ROAR is a flexible hierarchical clustering technique which learns to cluster patients based on learned high-level semantic similarities rather than rule-based metrics. It can accept whatever patient data may be available without modification to the underlying model architecture. The data modalities which Med-ROAR can accept are primarily constrained by computational resources, rather than architectural limitations.
期刊介绍:
The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.