Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction.

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Emmanuel Onah, Uche Jude Eze, Abdullahi Salahudeen Abdulraheem, Ugochukwu Gabriel Ezigbo, Kosisochi Chinwendu Amorha, Fidele Ntie-Kang
{"title":"Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction.","authors":"Emmanuel Onah, Uche Jude Eze, Abdullahi Salahudeen Abdulraheem, Ugochukwu Gabriel Ezigbo, Kosisochi Chinwendu Amorha, Fidele Ntie-Kang","doi":"10.1186/s12911-025-03018-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Differentiated thyroid cancer (DTC) is a common endocrine malignancy with rising incidence and frequent recurrence, despite a generally favorable prognosis. Accurate recurrence prediction is critical for guiding post-treatment strategies. This study aimed to enhance predictive performance by refining feature engineering and evaluating a diverse ensemble of machine learning models using the UCI DTC dataset.</p><p><strong>Methods: </strong>Unsupervised data engineering-specifically dimensionality reduction and clustering-was used to improve feature quality. Principal Component Analysis (PCA) and Truncated Singular Value Decomposition (t-SVD) were selected based on superior clustering metrics: adjusted Rand Index (ARI > 0.55) and V-measure (> 0.45). These were integrated into classification pipelines using Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Feedforward Neural Network (FNN), and Gradient Boosting (GB). Model performance was evaluated through bootstrapping on an independent test set, stratified 10-fold cross-validation (CV), and subgroup analyses. Metrics included balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision, each reported with 95% confidence intervals (CIs). SHAP analysis supported model interpretability.</p><p><strong>Results: </strong>The PCA-based LR pipeline achieved the best test set performance: balanced accuracy of 0.95 (95% CI: 0.90-0.99), AUC of 0.99 (95% CI: 0.97-1.00), and sensitivity of 0.94 (95% CI: 0.84-1.00). In stratified CV, it maintained strong results (balanced accuracy: 0.86; AUC: 0.97; sensitivity: 0.80), with consistent performance across clinically relevant subgroups. The t-SVD-based LR pipeline showed comparable performance on both test and CV sets. SVM and FNN pipelines also performed robustly (test AUCs > 0.99; CV AUCs > 0.96). RF and KNN had high specificity but slightly lower sensitivity (test: ~0.87; CV: 0.77-0.80). GB pipelines showed the lowest overall performance (test balanced accuracy: 0.86-0.88; CV: 0.85-0.88).</p><p><strong>Conclusions: </strong>Dimensionality reduction via PCA and t-SVD significantly improved model performance, particularly for LR, SVM, FNN, RF and KNN classifiers. The PCA-based LR pipeline showed the best generalizability, supporting its potential integration into clinical decision-support tools for personalized DTC management.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"182"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03018-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Differentiated thyroid cancer (DTC) is a common endocrine malignancy with rising incidence and frequent recurrence, despite a generally favorable prognosis. Accurate recurrence prediction is critical for guiding post-treatment strategies. This study aimed to enhance predictive performance by refining feature engineering and evaluating a diverse ensemble of machine learning models using the UCI DTC dataset.

Methods: Unsupervised data engineering-specifically dimensionality reduction and clustering-was used to improve feature quality. Principal Component Analysis (PCA) and Truncated Singular Value Decomposition (t-SVD) were selected based on superior clustering metrics: adjusted Rand Index (ARI > 0.55) and V-measure (> 0.45). These were integrated into classification pipelines using Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Feedforward Neural Network (FNN), and Gradient Boosting (GB). Model performance was evaluated through bootstrapping on an independent test set, stratified 10-fold cross-validation (CV), and subgroup analyses. Metrics included balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision, each reported with 95% confidence intervals (CIs). SHAP analysis supported model interpretability.

Results: The PCA-based LR pipeline achieved the best test set performance: balanced accuracy of 0.95 (95% CI: 0.90-0.99), AUC of 0.99 (95% CI: 0.97-1.00), and sensitivity of 0.94 (95% CI: 0.84-1.00). In stratified CV, it maintained strong results (balanced accuracy: 0.86; AUC: 0.97; sensitivity: 0.80), with consistent performance across clinically relevant subgroups. The t-SVD-based LR pipeline showed comparable performance on both test and CV sets. SVM and FNN pipelines also performed robustly (test AUCs > 0.99; CV AUCs > 0.96). RF and KNN had high specificity but slightly lower sensitivity (test: ~0.87; CV: 0.77-0.80). GB pipelines showed the lowest overall performance (test balanced accuracy: 0.86-0.88; CV: 0.85-0.88).

Conclusions: Dimensionality reduction via PCA and t-SVD significantly improved model performance, particularly for LR, SVM, FNN, RF and KNN classifiers. The PCA-based LR pipeline showed the best generalizability, supporting its potential integration into clinical decision-support tools for personalized DTC management.

Clinical trial number: Not applicable.

分化型甲状腺癌复发预测的无监督特征工程与分类管道优化。
背景:分化型甲状腺癌(DTC)是一种常见的内分泌恶性肿瘤,发病率上升,复发率高,但预后良好。准确的复发预测是指导治疗后策略的关键。本研究旨在通过改进特征工程和使用UCI DTC数据集评估不同的机器学习模型集合来提高预测性能。方法:采用无监督数据工程,特别是降维和聚类,来提高特征质量。采用主成分分析法(PCA)和截断奇异值分解法(t-SVD)进行聚类分析,得到了较优的聚类指标:调整后的Rand Index (ARI 0.55)和V-measure(> 0.45)。使用逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)、k近邻(KNN)、前馈神经网络(FNN)和梯度增强(GB)将这些集成到分类管道中。通过独立测试集的自举、分层10倍交叉验证(CV)和亚组分析来评估模型的性能。指标包括平衡准确性、F1评分、AUC、敏感性、特异性和精密度,每项指标均以95%置信区间(ci)报告。SHAP分析支持模型的可解释性。结果:基于pca的LR管道获得了最佳的测试集性能:平衡精度为0.95 (95% CI: 0.90-0.99), AUC为0.99 (95% CI: 0.97-1.00),灵敏度为0.94 (95% CI: 0.84-1.00)。在分层CV中,它保持了很强的结果(平衡精度:0.86;AUC: 0.97;敏感性:0.80),在临床相关亚组中表现一致。基于t- svd的LR管道在测试集和CV集上表现出相当的性能。支持向量机(SVM)和FNN管道也表现良好(测试auc为0.99;CV AUCs(0.96)。RF和KNN特异性高,敏感性略低(试验值:~0.87;简历:0.77 - -0.80)。国标管线整体性能最低(测试平衡精度:0.86-0.88;简历:0.85 - -0.88)。结论:通过PCA和t-SVD降维可以显著提高模型性能,特别是对于LR、SVM、FNN、RF和KNN分类器。基于pca的LR管道显示出最好的通用性,支持其潜在的整合到个性化DTC管理的临床决策支持工具中。临床试验号:不适用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信