Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study
{"title":"Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study","authors":"Peyman Ghasemi, Joon Lee","doi":"10.2196/52896","DOIUrl":null,"url":null,"abstract":"Background: The application of machine learning in healthcare often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the \"curse of dimensionality\" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD/ATC codes and the hierarchical structures of these systems. Objective: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of coronary artery disease patients in different aspects of performance and complexity and select the best set of features representing these patients. Methods: We compared several unsupervised feature selection methods for two ICD and one ATC code databases of 51,506 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD/ATC tree weight adjustment to select the 100 best features from over 9,000 ICD and 2,000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of selected features by mean code level in ICD/ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis. Results: In feature space reconstruction and mortality prediction, the Concrete Autoencoder-based methods outperformed other techniques. A weight-adjusted Concrete Autoencoder variant, particularly, demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong's and McNemar's tests (P<.05). Concrete Autoencoders preferred more general codes and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted Concrete Autoencoders yielded higher Shapley values in mortality prediction compared to most alternatives. Conclusions: This study scrutinized five feature selection methods in ICD/ATC code datasets in an unsupervised context. Our findings underscore the superiority of the Concrete Autoencoder method in selecting salient features that represent the entire dataset, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the Concrete Autoencoders specifically tailored for ICD/ATC code datasets to enhance the generalizability and interpretability of the selected features.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"61 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/52896","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The application of machine learning in healthcare often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the "curse of dimensionality" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD/ATC codes and the hierarchical structures of these systems. Objective: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of coronary artery disease patients in different aspects of performance and complexity and select the best set of features representing these patients. Methods: We compared several unsupervised feature selection methods for two ICD and one ATC code databases of 51,506 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD/ATC tree weight adjustment to select the 100 best features from over 9,000 ICD and 2,000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of selected features by mean code level in ICD/ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis. Results: In feature space reconstruction and mortality prediction, the Concrete Autoencoder-based methods outperformed other techniques. A weight-adjusted Concrete Autoencoder variant, particularly, demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong's and McNemar's tests (P<.05). Concrete Autoencoders preferred more general codes and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted Concrete Autoencoders yielded higher Shapley values in mortality prediction compared to most alternatives. Conclusions: This study scrutinized five feature selection methods in ICD/ATC code datasets in an unsupervised context. Our findings underscore the superiority of the Concrete Autoencoder method in selecting salient features that represent the entire dataset, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the Concrete Autoencoders specifically tailored for ICD/ATC code datasets to enhance the generalizability and interpretability of the selected features.
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.