{"title":"慢性肾脏疾病(CKD)的早期检测:一种新的混合特征选择方法和不同ML技术的鲁棒数据准备管道","authors":"Md. Taufiqul Haque Khan Tusar, Md. Touhidul Islam, Foyjul Islam Raju","doi":"10.48550/arXiv.2203.01394","DOIUrl":null,"url":null,"abstract":"Chronic Kidney Disease (CKD) has infected almost 800 million people around the world. Around 1.7 million people die each year because of it. Detecting CKD in the initial stage is essential for saving millions of lives. Many researchers have applied distinct Machine Learning (ML) methods to detect CKD at an early stage, but detailed studies are still missing. We present a structured and thorough method for dealing with the complexities of medical data with optimal performance. Besides, this study will assist researchers in producing clear ideas on the medical data preparation pipeline. In this paper, we applied KNN Imputation to impute missing values, Local Outlier Factor to remove outliers, SMOTE to handle data imbalance, K-stratified K-fold Cross-validation to validate the ML models, and a novel hybrid feature selection method to remove redundant features. Applied algorithms in this study are Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbour, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Finally, the Random Forest can detect CKD with 100% accuracy without any data leakage.","PeriodicalId":122550,"journal":{"name":"2022 5th International Conference on Computing and Informatics (ICCI)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Detecting Chronic Kidney Disease(CKD) at the Initial Stage: A Novel Hybrid Feature-selection Method and Robust Data Preparation Pipeline for Different ML Techniques\",\"authors\":\"Md. Taufiqul Haque Khan Tusar, Md. Touhidul Islam, Foyjul Islam Raju\",\"doi\":\"10.48550/arXiv.2203.01394\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Chronic Kidney Disease (CKD) has infected almost 800 million people around the world. Around 1.7 million people die each year because of it. Detecting CKD in the initial stage is essential for saving millions of lives. Many researchers have applied distinct Machine Learning (ML) methods to detect CKD at an early stage, but detailed studies are still missing. We present a structured and thorough method for dealing with the complexities of medical data with optimal performance. Besides, this study will assist researchers in producing clear ideas on the medical data preparation pipeline. In this paper, we applied KNN Imputation to impute missing values, Local Outlier Factor to remove outliers, SMOTE to handle data imbalance, K-stratified K-fold Cross-validation to validate the ML models, and a novel hybrid feature selection method to remove redundant features. Applied algorithms in this study are Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbour, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Finally, the Random Forest can detect CKD with 100% accuracy without any data leakage.\",\"PeriodicalId\":122550,\"journal\":{\"name\":\"2022 5th International Conference on Computing and Informatics (ICCI)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 5th International Conference on Computing and Informatics (ICCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2203.01394\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Computing and Informatics (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.01394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
摘要
慢性肾脏疾病(CKD)已经感染了全世界近8亿人。每年约有170万人因此死亡。在早期发现CKD对于挽救数百万人的生命至关重要。许多研究人员已经应用不同的机器学习(ML)方法在早期检测CKD,但详细的研究仍然缺失。我们提出了一种结构化和彻底的方法来处理具有最佳性能的医疗数据的复杂性。此外,本研究将有助于研究人员对医疗数据制备管道产生明确的想法。本文采用KNN Imputation法对缺失值进行Imputation, Local Outlier Factor法对异常值进行去除,SMOTE法对数据不平衡进行处理,K-stratified K-fold Cross-validation法对ML模型进行验证,并采用一种新的混合特征选择方法去除冗余特征。本研究中应用的算法有支持向量机、高斯朴素贝叶斯、决策树、随机森林、逻辑回归、k近邻、梯度增强、自适应增强和极端梯度增强。最后,随机森林可以100%的准确率检测CKD,没有任何数据泄漏。
Detecting Chronic Kidney Disease(CKD) at the Initial Stage: A Novel Hybrid Feature-selection Method and Robust Data Preparation Pipeline for Different ML Techniques
Chronic Kidney Disease (CKD) has infected almost 800 million people around the world. Around 1.7 million people die each year because of it. Detecting CKD in the initial stage is essential for saving millions of lives. Many researchers have applied distinct Machine Learning (ML) methods to detect CKD at an early stage, but detailed studies are still missing. We present a structured and thorough method for dealing with the complexities of medical data with optimal performance. Besides, this study will assist researchers in producing clear ideas on the medical data preparation pipeline. In this paper, we applied KNN Imputation to impute missing values, Local Outlier Factor to remove outliers, SMOTE to handle data imbalance, K-stratified K-fold Cross-validation to validate the ML models, and a novel hybrid feature selection method to remove redundant features. Applied algorithms in this study are Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbour, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Finally, the Random Forest can detect CKD with 100% accuracy without any data leakage.