Abdussamad, Hanita Daud, Rajalingam Sokkalingam, Muhammad Zubair, Iliyas Karim Khan, Zafar Mahmood
{"title":"使用混合堆叠稀疏自编码器和机器学习模型的基于潜在特征的2型糖尿病预测","authors":"Abdussamad, Hanita Daud, Rajalingam Sokkalingam, Muhammad Zubair, Iliyas Karim Khan, Zafar Mahmood","doi":"10.1002/eng2.70358","DOIUrl":null,"url":null,"abstract":"<p>Early and precise prediction of Type 2 diabetes is vital for effective intervention. However, extracting meaningful insights from high-dimensional datasets with sparse values remains challenging. Sparsity and redundant features often hinder traditional machine learning algorithms' ability to identify informative patterns. While conventional Stacked Sparse Autoencoders (SSAE) can capture key features in dense data, they typically struggle with high-dimensional sparse data, reducing classification accuracy. To address this limitation, the study proposes a Hybrid Stacked Sparse Autoencoder (HSSAE) algorithm designed for robust feature extraction and classification in sparse data environments. The architecture incorporates L1 and L2 regularization within a binary cross-entropy loss and employs dropout and batch normalization to improve generalization and training stability. The HSSAE algorithm's performance was tested with a sigmoid classifier and various machine learning techniques. When combined with a sigmoid layer, the model achieved 89% accuracy and an <i>F</i>1 score of 0.89. It also outperformed baseline models when integrated with traditional classifiers; notably, the HSSAE + K-Nearest Neighbor (KNN) achieved an <i>F</i>1 score of 0.91, a recall of 0.98, 90% accuracy, and the lowest hamming loss of 0.10. Comparative evaluations included baseline classifiers like Logistic Regression (LR), KNNs, Naïve Bayes (NB), AdaBoost, and XGBoost, applied directly to the preprocessed dataset. An ablation study tested these classifiers on features extracted via the SSAE. In both cases, the HSSAE algorithm showed superior performance across all metrics. These findings demonstrate the HSSAE algorithm's effectiveness in extracting discriminative features from sparse, high-dimensional data, emphasizing its potential for clinical decision support systems requiring high accuracy and reliability.</p>","PeriodicalId":72922,"journal":{"name":"Engineering reports : open access","volume":"7 9","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.70358","citationCount":"0","resultStr":"{\"title\":\"Latent Feature-Based Type 2 Diabetes Prediction Using a Hybrid Stacked Sparse Autoencoder and Machine Learning Models\",\"authors\":\"Abdussamad, Hanita Daud, Rajalingam Sokkalingam, Muhammad Zubair, Iliyas Karim Khan, Zafar Mahmood\",\"doi\":\"10.1002/eng2.70358\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Early and precise prediction of Type 2 diabetes is vital for effective intervention. However, extracting meaningful insights from high-dimensional datasets with sparse values remains challenging. Sparsity and redundant features often hinder traditional machine learning algorithms' ability to identify informative patterns. While conventional Stacked Sparse Autoencoders (SSAE) can capture key features in dense data, they typically struggle with high-dimensional sparse data, reducing classification accuracy. To address this limitation, the study proposes a Hybrid Stacked Sparse Autoencoder (HSSAE) algorithm designed for robust feature extraction and classification in sparse data environments. The architecture incorporates L1 and L2 regularization within a binary cross-entropy loss and employs dropout and batch normalization to improve generalization and training stability. The HSSAE algorithm's performance was tested with a sigmoid classifier and various machine learning techniques. When combined with a sigmoid layer, the model achieved 89% accuracy and an <i>F</i>1 score of 0.89. It also outperformed baseline models when integrated with traditional classifiers; notably, the HSSAE + K-Nearest Neighbor (KNN) achieved an <i>F</i>1 score of 0.91, a recall of 0.98, 90% accuracy, and the lowest hamming loss of 0.10. Comparative evaluations included baseline classifiers like Logistic Regression (LR), KNNs, Naïve Bayes (NB), AdaBoost, and XGBoost, applied directly to the preprocessed dataset. An ablation study tested these classifiers on features extracted via the SSAE. In both cases, the HSSAE algorithm showed superior performance across all metrics. These findings demonstrate the HSSAE algorithm's effectiveness in extracting discriminative features from sparse, high-dimensional data, emphasizing its potential for clinical decision support systems requiring high accuracy and reliability.</p>\",\"PeriodicalId\":72922,\"journal\":{\"name\":\"Engineering reports : open access\",\"volume\":\"7 9\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/eng2.70358\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering reports : open access\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/eng2.70358\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering reports : open access","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/eng2.70358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
摘要
2型糖尿病的早期准确预测对有效干预至关重要。然而,从具有稀疏值的高维数据集中提取有意义的见解仍然具有挑战性。稀疏性和冗余特征通常会阻碍传统机器学习算法识别信息模式的能力。虽然传统的堆叠稀疏自编码器(SSAE)可以捕获密集数据中的关键特征,但它们通常难以处理高维稀疏数据,从而降低了分类精度。为了解决这一限制,该研究提出了一种混合堆叠稀疏自编码器(HSSAE)算法,用于稀疏数据环境下的鲁棒特征提取和分类。该体系结构在二元交叉熵损失中结合了L1和L2正则化,并采用dropout和批处理归一化来提高泛化和训练稳定性。使用s型分类器和各种机器学习技术测试了HSSAE算法的性能。当与s形层结合时,模型的准确率达到89%,F1得分为0.89。当与传统分类器集成时,它也优于基线模型;值得注意的是,HSSAE + k -最近邻(KNN)的F1得分为0.91,召回率为0.98,准确率为90%,最低汉明损失为0.10。比较评估包括基线分类器,如Logistic回归(LR)、KNNs、Naïve贝叶斯(NB)、AdaBoost和XGBoost,直接应用于预处理数据集。消融研究测试了这些分类器通过SSAE提取的特征。在这两种情况下,HSSAE算法在所有指标上都表现出优异的性能。这些发现证明了HSSAE算法在从稀疏的高维数据中提取判别特征方面的有效性,强调了其在需要高精度和可靠性的临床决策支持系统中的潜力。
Latent Feature-Based Type 2 Diabetes Prediction Using a Hybrid Stacked Sparse Autoencoder and Machine Learning Models
Early and precise prediction of Type 2 diabetes is vital for effective intervention. However, extracting meaningful insights from high-dimensional datasets with sparse values remains challenging. Sparsity and redundant features often hinder traditional machine learning algorithms' ability to identify informative patterns. While conventional Stacked Sparse Autoencoders (SSAE) can capture key features in dense data, they typically struggle with high-dimensional sparse data, reducing classification accuracy. To address this limitation, the study proposes a Hybrid Stacked Sparse Autoencoder (HSSAE) algorithm designed for robust feature extraction and classification in sparse data environments. The architecture incorporates L1 and L2 regularization within a binary cross-entropy loss and employs dropout and batch normalization to improve generalization and training stability. The HSSAE algorithm's performance was tested with a sigmoid classifier and various machine learning techniques. When combined with a sigmoid layer, the model achieved 89% accuracy and an F1 score of 0.89. It also outperformed baseline models when integrated with traditional classifiers; notably, the HSSAE + K-Nearest Neighbor (KNN) achieved an F1 score of 0.91, a recall of 0.98, 90% accuracy, and the lowest hamming loss of 0.10. Comparative evaluations included baseline classifiers like Logistic Regression (LR), KNNs, Naïve Bayes (NB), AdaBoost, and XGBoost, applied directly to the preprocessed dataset. An ablation study tested these classifiers on features extracted via the SSAE. In both cases, the HSSAE algorithm showed superior performance across all metrics. These findings demonstrate the HSSAE algorithm's effectiveness in extracting discriminative features from sparse, high-dimensional data, emphasizing its potential for clinical decision support systems requiring high accuracy and reliability.