Unsupervised clustering for sepsis identification in large-scale patient data: a model development and validation study.

IF 2.8 Q2 CRITICAL CARE MEDICINE

Intensive Care Medicine Experimental Pub Date : 2025-03-20 DOI:10.1186/s40635-025-00744-w

Na Li, Kiarash Riazi, Jie Pan, Kednapa Thavorn, Jennifer Ziegler, Bram Rochwerg, Hude Quan, Hallie C Prescott, Peter M Dodek, Bing Li, Alain Gervais, Allan Garland

{"title":"Unsupervised clustering for sepsis identification in large-scale patient data: a model development and validation study.","authors":"Na Li, Kiarash Riazi, Jie Pan, Kednapa Thavorn, Jennifer Ziegler, Bram Rochwerg, Hude Quan, Hallie C Prescott, Peter M Dodek, Bing Li, Alain Gervais, Allan Garland","doi":"10.1186/s40635-025-00744-w","DOIUrl":null,"url":null,"abstract":"Background: Sepsis is a major global health problem. However, it lacks a true reference standard for case identification, complicating epidemiologic surveillance. Consensus definitions have changed multiple times, clinicians struggle to identify sepsis at the bedside, and differing identification algorithms generate wide variation in incidence rates. The two current identification approaches use codes from administrative data, or electronic health record (EHR)-based algorithms such as the Center for Disease Control Adult Sepsis Event (ASE); both have limitations. Here our primary purpose is to report initial steps in developing a novel approach to identifying sepsis using unsupervised clustering methods. Secondarily, we report preliminary analysis of resulting clusters, using identification by ASE criteria as a familiar comparator.Methods: This retrospective cohort study used hospital administrative and EHR data on adults admitted to intensive care units (ICUs) at five Canadian medical centres (2015-2017), with split development and validation cohorts. After preprocessing 592 variables (demographics, encounter characteristics, diagnoses, medications, laboratory tests, and clinical management) and applying data reduction, we presented 55 principal components to eight different clustering algorithms. An automated elbow method determined the optimal number of clusters, and the optimal algorithm was selected based on clustering metrics for consistency, separation, distribution and stability. Cluster membership in the validation cohort was assigned using an XGBoost model trained to predict cluster membership in the development cohort. For cluster analysis, we prospectively subdivided clusters by their fractions meeting ASE criteria (≥ 50% ASE-majority clusters vs. ASE-minority clusters), and compared their characteristics.Results: There were 3660 patients in the development cohort and 3012 in the validation cohort, of which 21.5% (development) and 19.1% (validation) were ASE (+). The Robust and Sparse K-means Clustering (RSKC) method performed best. In the development cohort, it identified 48 clusters of hospitalizations; 11 ASE-majority clusters contained 22.4% of all patients but 77.8% of all ASE (+) patients. 34.9% of the 209 ASE (-) patients in the ASE-majority clusters met more liberal ASE criteria for sepsis. Findings were consistent in the validation cohort.Conclusions: Unsupervised clustering applied to diverse, large-scale medical data offers a promising approach to the identification of sepsis phenotypes for epidemiological surveillance.","PeriodicalId":13750,"journal":{"name":"Intensive Care Medicine Experimental","volume":"13 1","pages":"37"},"PeriodicalIF":2.8000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11925832/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intensive Care Medicine Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40635-025-00744-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Sepsis is a major global health problem. However, it lacks a true reference standard for case identification, complicating epidemiologic surveillance. Consensus definitions have changed multiple times, clinicians struggle to identify sepsis at the bedside, and differing identification algorithms generate wide variation in incidence rates. The two current identification approaches use codes from administrative data, or electronic health record (EHR)-based algorithms such as the Center for Disease Control Adult Sepsis Event (ASE); both have limitations. Here our primary purpose is to report initial steps in developing a novel approach to identifying sepsis using unsupervised clustering methods. Secondarily, we report preliminary analysis of resulting clusters, using identification by ASE criteria as a familiar comparator.

Methods: This retrospective cohort study used hospital administrative and EHR data on adults admitted to intensive care units (ICUs) at five Canadian medical centres (2015-2017), with split development and validation cohorts. After preprocessing 592 variables (demographics, encounter characteristics, diagnoses, medications, laboratory tests, and clinical management) and applying data reduction, we presented 55 principal components to eight different clustering algorithms. An automated elbow method determined the optimal number of clusters, and the optimal algorithm was selected based on clustering metrics for consistency, separation, distribution and stability. Cluster membership in the validation cohort was assigned using an XGBoost model trained to predict cluster membership in the development cohort. For cluster analysis, we prospectively subdivided clusters by their fractions meeting ASE criteria (≥ 50% ASE-majority clusters vs. ASE-minority clusters), and compared their characteristics.

Results: There were 3660 patients in the development cohort and 3012 in the validation cohort, of which 21.5% (development) and 19.1% (validation) were ASE (+). The Robust and Sparse K-means Clustering (RSKC) method performed best. In the development cohort, it identified 48 clusters of hospitalizations; 11 ASE-majority clusters contained 22.4% of all patients but 77.8% of all ASE (+) patients. 34.9% of the 209 ASE (-) patients in the ASE-majority clusters met more liberal ASE criteria for sepsis. Findings were consistent in the validation cohort.

Conclusions: Unsupervised clustering applied to diverse, large-scale medical data offers a promising approach to the identification of sepsis phenotypes for epidemiological surveillance.

Abstract Image

查看原文本刊更多论文

大规模患者数据中脓毒症识别的无监督聚类：模型开发和验证研究。

背景：脓毒症是一个主要的全球性健康问题。然而，它缺乏病例识别的真正参考标准，使流行病学监测复杂化。共识定义已多次改变，临床医生难以在床边识别败血症，不同的识别算法导致发病率差异很大。目前的两种识别方法使用来自行政数据的代码，或基于电子健康记录（EHR）的算法，如疾病控制中心成人败血症事件（ASE）；两者都有局限性。在这里，我们的主要目的是报告开发一种使用无监督聚类方法识别脓毒症的新方法的初步步骤。其次，我们报告了对结果集群的初步分析，使用ASE标准作为熟悉的比较物进行识别。方法：这项回顾性队列研究使用了加拿大五家医疗中心（2015-2017年）重症监护病房（icu）住院成人的医院行政和电子病历数据，并进行了分裂的开发和验证队列。在对592个变量（人口统计学、偶遇特征、诊断、药物、实验室测试和临床管理）进行预处理并应用数据约简后，我们将55个主成分呈现给8种不同的聚类算法。采用自动弯头法确定最优聚类数，并根据聚类的一致性、分离性、分布性和稳定性等指标选择最优算法。验证队列中的集群成员是使用经过训练的XGBoost模型来分配的，该模型可以预测开发队列中的集群成员。对于聚类分析，我们前瞻性地按照满足ASE标准的分数细分聚类（≥50%的ASE多数聚类与ASE少数聚类），并比较它们的特征。结果：开发队列3660例，验证队列3012例，其中21.5%（开发）和19.1%（验证）为ASE（+）。鲁棒稀疏k均值聚类（RSKC）方法表现最好。在发展队列中，它确定了48个住院组；11个ASE多数集群占所有患者的22.4%，但占所有ASE（+）患者的77.8%。在209例ASE（-）患者中，34.9%的患者符合更自由的ASE脓毒症标准。结果在验证队列中是一致的。结论：将无监督聚类应用于多样化的大规模医疗数据，为流行病学监测中脓毒症表型的识别提供了一种有前途的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊