Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2024-07-26 DOI:10.2196/52896

Peyman Ghasemi, Joon Lee

{"title":"Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study","authors":"Peyman Ghasemi, Joon Lee","doi":"10.2196/52896","DOIUrl":null,"url":null,"abstract":"Background: The application of machine learning in healthcare often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the \"curse of dimensionality\" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD/ATC codes and the hierarchical structures of these systems. Objective: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of coronary artery disease patients in different aspects of performance and complexity and select the best set of features representing these patients. Methods: We compared several unsupervised feature selection methods for two ICD and one ATC code databases of 51,506 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD/ATC tree weight adjustment to select the 100 best features from over 9,000 ICD and 2,000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of selected features by mean code level in ICD/ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis. Results: In feature space reconstruction and mortality prediction, the Concrete Autoencoder-based methods outperformed other techniques. A weight-adjusted Concrete Autoencoder variant, particularly, demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong's and McNemar's tests (P<.05). Concrete Autoencoders preferred more general codes and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted Concrete Autoencoders yielded higher Shapley values in mortality prediction compared to most alternatives. Conclusions: This study scrutinized five feature selection methods in ICD/ATC code datasets in an unsupervised context. Our findings underscore the superiority of the Concrete Autoencoder method in selecting salient features that represent the entire dataset, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the Concrete Autoencoders specifically tailored for ICD/ATC code datasets to enhance the generalizability and interpretability of the selected features.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"61 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/52896","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The application of machine learning in healthcare often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the "curse of dimensionality" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD/ATC codes and the hierarchical structures of these systems. Objective: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of coronary artery disease patients in different aspects of performance and complexity and select the best set of features representing these patients. Methods: We compared several unsupervised feature selection methods for two ICD and one ATC code databases of 51,506 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD/ATC tree weight adjustment to select the 100 best features from over 9,000 ICD and 2,000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of selected features by mean code level in ICD/ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis. Results: In feature space reconstruction and mortality prediction, the Concrete Autoencoder-based methods outperformed other techniques. A weight-adjusted Concrete Autoencoder variant, particularly, demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong's and McNemar's tests (P<.05). Concrete Autoencoders preferred more general codes and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted Concrete Autoencoders yielded higher Shapley values in mortality prediction compared to most alternatives. Conclusions: This study scrutinized five feature selection methods in ICD/ATC code datasets in an unsupervised context. Our findings underscore the superiority of the Concrete Autoencoder method in selecting salient features that represent the entire dataset, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the Concrete Autoencoders specifically tailored for ICD/ATC code datasets to enhance the generalizability and interpretability of the selected features.

查看原文本刊更多论文

通过无监督特征选择来识别重要的 ICD-10 和 ATC 编码，以便对冠心病患者队列进行机器学习：回顾性研究

背景：机器学习在医疗保健领域的应用通常需要使用分级代码，如国际疾病分类（ICD）和解剖治疗化学（ATC）系统。这些代码分别对疾病和药物进行分类，从而形成了广泛的数据维度。无监督特征选择可以解决 "维度诅咒 "问题，通过减少无关或冗余特征的数量，避免过度拟合，从而帮助提高有监督学习模型的准确性和性能。无监督特征选择技术，如过滤法、包装法和嵌入法等，都是为了选择具有最多内在信息的最重要特征。然而，由于 ICD/ATC 代码的庞大数量和这些系统的分层结构，这些技术面临着挑战。研究目的本研究的目的是比较几种针对冠心病患者 ICD 和 ATC 代码数据库的无监督特征选择方法在不同方面的性能和复杂性，并选出代表这些患者的最佳特征集。方法：我们针对加拿大艾伯塔省 51,506 名冠心病患者的两个 ICD 和一个 ATC 代码数据库，比较了几种无监督特征选择方法。具体来说，我们采用了拉普拉卡方评分法、多集群数据无监督特征选择法、自动编码器启发无监督特征选择法、主特征分析法以及带有或不带有 ICD/ATC 树权重调整的混凝土自动编码器，从 9000 多个 ICD 和 2000 多个 ATC 代码中选出了 100 个最佳特征。我们根据所选特征重建初始特征空间和预测出院后 90 天死亡率的能力对其进行了评估。我们还通过 ICD/ATC 树中的平均代码级别比较了所选特征的复杂性，并使用 Shapley 分析比较了死亡率预测任务中特征的可解释性。结果：在特征空间重建和死亡率预测方面，基于具体自动编码器的方法优于其他技术。特别是经过权重调整的混凝土自动编码器变体，其重建准确性得到了提高，预测性能也有显著增强，这一点已通过 DeLong 检验和 McNemar 检验得到证实（P<.05）。具体自动编码器更倾向于使用更通用的代码，而且它们始终能准确地重建所有特征。此外，与大多数替代方法相比，权重调整后的具体自动编码器选择的特征在死亡率预测中产生了更高的 Shapley 值。结论本研究在无监督的情况下仔细研究了 ICD/ATC 代码数据集中的五种特征选择方法。我们的研究结果强调了混凝土自动编码器方法在选择代表整个数据集的突出特征方面的优越性，为后续的机器学习研究提供了潜在的资产。我们还针对 ICD/ATC 代码数据集提出了一种新颖的具体自动编码器权重调整方法，以增强所选特征的通用性和可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.