Effect of dimension reduction with PCA and machine learning algorithms on diabetes diagnosis performance

Yavuz Bahadir Koca, Elif Aktepe
{"title":"Effect of dimension reduction with PCA and machine learning algorithms on diabetes diagnosis performance","authors":"Yavuz Bahadir Koca, Elif Aktepe","doi":"10.31127/tuje.1413087","DOIUrl":null,"url":null,"abstract":"Diabetes, a long-term metabolic disorder, causes persistently high blood sugar and presents a significant global health challenge. Early diagnosis is of vital importance in mitigating the effects of diabetes. This study aims to investigate diabetes diagnosis and risk prediction using a comprehensive diabetes dataset created in 2023. The dataset contains clinical and anthropometric data of patients. Data simplification was successfully applied to clean unnecessary information and reduce data dimensionality. Additionally, methods like Principal Component Analysis were applied to decrease the number of variables in the dataset. These analyses rendered the dataset more manageable and improved its performance. In this study, a dataset encompassing health data of a total of 100,000 individuals was utilized. This dataset consists of 8 input features and 1 output feature. The primary objective is to determine the algorithm that exhibits the best performance for diabetes diagnosis. There was no missing data during the data preprocessing stage, and the necessary transformations were carried out successfully. Nine different machine learning algorithms were applied to the dataset in this study. Each algorithm employed various modelling approaches to evaluate its performance in diagnosing diabetes. The results demonstrate that machine learning models are successful in predicting the presence of diabetes and the risk of developing it in healthy individuals. Particularly, the random forest model provided superior results across all performance metrics. This study provides significant findings that can shed light on future research in diabetes diagnosis and risk prediction. Dimensionality reduction techniques have proven to be valuable in data analysis and have highlighted the potential to facilitate diabetes diagnosis, thereby enhancing the quality of life for patients.","PeriodicalId":518565,"journal":{"name":"Turkish Journal of Engineering","volume":" 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish Journal of Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31127/tuje.1413087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Diabetes, a long-term metabolic disorder, causes persistently high blood sugar and presents a significant global health challenge. Early diagnosis is of vital importance in mitigating the effects of diabetes. This study aims to investigate diabetes diagnosis and risk prediction using a comprehensive diabetes dataset created in 2023. The dataset contains clinical and anthropometric data of patients. Data simplification was successfully applied to clean unnecessary information and reduce data dimensionality. Additionally, methods like Principal Component Analysis were applied to decrease the number of variables in the dataset. These analyses rendered the dataset more manageable and improved its performance. In this study, a dataset encompassing health data of a total of 100,000 individuals was utilized. This dataset consists of 8 input features and 1 output feature. The primary objective is to determine the algorithm that exhibits the best performance for diabetes diagnosis. There was no missing data during the data preprocessing stage, and the necessary transformations were carried out successfully. Nine different machine learning algorithms were applied to the dataset in this study. Each algorithm employed various modelling approaches to evaluate its performance in diagnosing diabetes. The results demonstrate that machine learning models are successful in predicting the presence of diabetes and the risk of developing it in healthy individuals. Particularly, the random forest model provided superior results across all performance metrics. This study provides significant findings that can shed light on future research in diabetes diagnosis and risk prediction. Dimensionality reduction techniques have proven to be valuable in data analysis and have highlighted the potential to facilitate diabetes diagnosis, thereby enhancing the quality of life for patients.
利用 PCA 和机器学习算法降维对糖尿病诊断性能的影响
糖尿病是一种长期代谢紊乱疾病,会导致持续高血糖,对全球健康构成重大挑战。早期诊断对减轻糖尿病的影响至关重要。本研究旨在利用 2023 年创建的糖尿病综合数据集研究糖尿病诊断和风险预测。数据集包含患者的临床和人体测量数据。数据简化被成功应用于清理不必要的信息和降低数据维度。此外,还采用了主成分分析等方法来减少数据集中的变量数量。这些分析使数据集更易于管理,并提高了数据集的性能。本研究使用了一个数据集,其中包含总共 100,000 人的健康数据。该数据集由 8 个输入特征和 1 个输出特征组成。主要目的是确定在糖尿病诊断方面表现最佳的算法。在数据预处理阶段没有数据缺失,并且成功进行了必要的转换。本研究对数据集采用了九种不同的机器学习算法。每种算法都采用了不同的建模方法,以评估其在诊断糖尿病方面的性能。结果表明,机器学习模型能成功预测健康人是否患有糖尿病以及患糖尿病的风险。特别是随机森林模型在所有性能指标上都取得了优异的结果。这项研究提供了重要发现,可为未来糖尿病诊断和风险预测研究提供启示。降维技术已被证明在数据分析中很有价值,并凸显了促进糖尿病诊断的潜力,从而提高患者的生活质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信