Autoencoder-Based Representation Learning for Similar Patients Retrieval From Electronic Health Records: Comparative Study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-07-24 DOI:10.2196/68830

Deyi Li, Aditi Shukla, Sravani Chandaka, Bradley Taylor, Jie Xu, Mei Liu

{"title":"Autoencoder-Based Representation Learning for Similar Patients Retrieval From Electronic Health Records: Comparative Study.","authors":"Deyi Li, Aditi Shukla, Sravani Chandaka, Bradley Taylor, Jie Xu, Mei Liu","doi":"10.2196/68830","DOIUrl":null,"url":null,"abstract":"Background: By analyzing electronic health record snapshots of similar patients, physicians can proactively predict disease onsets, customize treatment plans, and anticipate patient-specific trajectories. However, the modeling of electronic health record data is inherently challenging due to its high dimensionality, mixed feature types, noise, bias, and sparsity. Patient representation learning using autoencoders (AEs) presents promising opportunities to address these challenges. A critical question remains: how do different AE designs and distance measures impact the quality of retrieved similar patient cohorts?Objective: This study aims to evaluate the performance of 5 common AE variants-vanilla autoencoder, denoising autoencoder, contractive autoencoder, sparse autoencoder, and robust autoencoder-in retrieving similar patients. Additionally, it investigates the impact of different distance measures and hyperparameter configurations on model performance.Methods: We tested the 5 AE variants on 2 real-world datasets-the University of Kansas Medical Center (n=13,752) and the Medical College of Wisconsin (n=9568)-across 168 different hyperparameter configurations. To retrieve similar patients based on the AE-produced latent representations, we applied k-nearest neighbors (k-NN) using Euclidean and Mahalanobis distances. Two prediction targets were evaluated: acute kidney injury onset and postdischarge 1-year mortality.Results: Our findings demonstrate that (1) denoising autoencoders outperformed other AE variants when paired with Euclidean distance (P<.001), followed by vanilla autoencoders and contractive autoencoders; (2) learning rates significantly influenced the performance of AE variants; and (3) Mahalanobis distance-based k-NN frequently outperformed Euclidean distance-based k-NN when applied to latent representations. However, whether AE models are superior in transforming raw data into latent representations, compared with applying Mahalanobis distance-based k-NN directly to raw data, appears to be data-dependent.Conclusions: This study provides a comprehensive analysis of the performance of different AE variants in retrieving similar patients and evaluates the impact of various hyperparameter configurations on model performance. The findings lay the groundwork for future development of AE-based patient similarity estimation and personalized medicine.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e68830"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12289314/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/68830","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: By analyzing electronic health record snapshots of similar patients, physicians can proactively predict disease onsets, customize treatment plans, and anticipate patient-specific trajectories. However, the modeling of electronic health record data is inherently challenging due to its high dimensionality, mixed feature types, noise, bias, and sparsity. Patient representation learning using autoencoders (AEs) presents promising opportunities to address these challenges. A critical question remains: how do different AE designs and distance measures impact the quality of retrieved similar patient cohorts?

Objective: This study aims to evaluate the performance of 5 common AE variants-vanilla autoencoder, denoising autoencoder, contractive autoencoder, sparse autoencoder, and robust autoencoder-in retrieving similar patients. Additionally, it investigates the impact of different distance measures and hyperparameter configurations on model performance.

Methods: We tested the 5 AE variants on 2 real-world datasets-the University of Kansas Medical Center (n=13,752) and the Medical College of Wisconsin (n=9568)-across 168 different hyperparameter configurations. To retrieve similar patients based on the AE-produced latent representations, we applied k-nearest neighbors (k-NN) using Euclidean and Mahalanobis distances. Two prediction targets were evaluated: acute kidney injury onset and postdischarge 1-year mortality.

Results: Our findings demonstrate that (1) denoising autoencoders outperformed other AE variants when paired with Euclidean distance (P<.001), followed by vanilla autoencoders and contractive autoencoders; (2) learning rates significantly influenced the performance of AE variants; and (3) Mahalanobis distance-based k-NN frequently outperformed Euclidean distance-based k-NN when applied to latent representations. However, whether AE models are superior in transforming raw data into latent representations, compared with applying Mahalanobis distance-based k-NN directly to raw data, appears to be data-dependent.

Conclusions: This study provides a comprehensive analysis of the performance of different AE variants in retrieving similar patients and evaluates the impact of various hyperparameter configurations on model performance. The findings lay the groundwork for future development of AE-based patient similarity estimation and personalized medicine.

Abstract Image

查看原文本刊更多论文

基于自编码器的表示学习在电子病历中检索相似病人：比较研究。

背景：通过分析类似患者的电子健康记录快照，医生可以主动预测疾病发作，定制治疗计划，并预测患者特定的轨迹。然而，由于电子健康记录数据的高维性、混合特征类型、噪声、偏差和稀疏性，其建模本身就具有挑战性。使用自编码器（ae）的患者表征学习为解决这些挑战提供了有希望的机会。一个关键的问题仍然存在：不同的AE设计和距离测量如何影响检索到的相似患者队列的质量？目的：本研究旨在评价5种常见的AE变体——香草型自编码器、去噪型自编码器、收缩型自编码器、稀疏型自编码器和鲁棒型自编码器在检索相似患者中的性能。此外，还研究了不同距离度量和超参数配置对模型性能的影响。方法：我们在2个真实世界的数据集——堪萨斯大学医学中心（n=13,752）和威斯康星医学院（n=9568）上测试了5种AE变体，涉及168种不同的超参数配置。为了根据ae产生的潜在表征检索相似的患者，我们使用欧几里得和马氏距离应用k-近邻（k-NN）。评估两个预测指标：急性肾损伤发病和出院后1年死亡率。结果：我们的研究结果表明：(1)当与欧几里得距离配对时，去噪自编码器的性能优于其他AE变体(p)。结论：本研究全面分析了不同AE变体在检索相似患者时的性能，并评估了各种超参数配置对模型性能的影响。研究结果为未来基于ae的患者相似性估计和个性化医疗的发展奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.