Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-05-13 DOI:10.1186/s12911-025-03018-3

Emmanuel Onah, Uche Jude Eze, Abdullahi Salahudeen Abdulraheem, Ugochukwu Gabriel Ezigbo, Kosisochi Chinwendu Amorha, Fidele Ntie-Kang

{"title":"Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction.","authors":"Emmanuel Onah, Uche Jude Eze, Abdullahi Salahudeen Abdulraheem, Ugochukwu Gabriel Ezigbo, Kosisochi Chinwendu Amorha, Fidele Ntie-Kang","doi":"10.1186/s12911-025-03018-3","DOIUrl":null,"url":null,"abstract":"Background: Differentiated thyroid cancer (DTC) is a common endocrine malignancy with rising incidence and frequent recurrence, despite a generally favorable prognosis. Accurate recurrence prediction is critical for guiding post-treatment strategies. This study aimed to enhance predictive performance by refining feature engineering and evaluating a diverse ensemble of machine learning models using the UCI DTC dataset.Methods: Unsupervised data engineering-specifically dimensionality reduction and clustering-was used to improve feature quality. Principal Component Analysis (PCA) and Truncated Singular Value Decomposition (t-SVD) were selected based on superior clustering metrics: adjusted Rand Index (ARI > 0.55) and V-measure (> 0.45). These were integrated into classification pipelines using Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Feedforward Neural Network (FNN), and Gradient Boosting (GB). Model performance was evaluated through bootstrapping on an independent test set, stratified 10-fold cross-validation (CV), and subgroup analyses. Metrics included balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision, each reported with 95% confidence intervals (CIs). SHAP analysis supported model interpretability.Results: The PCA-based LR pipeline achieved the best test set performance: balanced accuracy of 0.95 (95% CI: 0.90-0.99), AUC of 0.99 (95% CI: 0.97-1.00), and sensitivity of 0.94 (95% CI: 0.84-1.00). In stratified CV, it maintained strong results (balanced accuracy: 0.86; AUC: 0.97; sensitivity: 0.80), with consistent performance across clinically relevant subgroups. The t-SVD-based LR pipeline showed comparable performance on both test and CV sets. SVM and FNN pipelines also performed robustly (test AUCs > 0.99; CV AUCs > 0.96). RF and KNN had high specificity but slightly lower sensitivity (test: ~0.87; CV: 0.77-0.80). GB pipelines showed the lowest overall performance (test balanced accuracy: 0.86-0.88; CV: 0.85-0.88).Conclusions: Dimensionality reduction via PCA and t-SVD significantly improved model performance, particularly for LR, SVM, FNN, RF and KNN classifiers. The PCA-based LR pipeline showed the best generalizability, supporting its potential integration into clinical decision-support tools for personalized DTC management.Clinical trial number: Not applicable.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"182"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03018-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Differentiated thyroid cancer (DTC) is a common endocrine malignancy with rising incidence and frequent recurrence, despite a generally favorable prognosis. Accurate recurrence prediction is critical for guiding post-treatment strategies. This study aimed to enhance predictive performance by refining feature engineering and evaluating a diverse ensemble of machine learning models using the UCI DTC dataset.

Methods: Unsupervised data engineering-specifically dimensionality reduction and clustering-was used to improve feature quality. Principal Component Analysis (PCA) and Truncated Singular Value Decomposition (t-SVD) were selected based on superior clustering metrics: adjusted Rand Index (ARI > 0.55) and V-measure (> 0.45). These were integrated into classification pipelines using Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Feedforward Neural Network (FNN), and Gradient Boosting (GB). Model performance was evaluated through bootstrapping on an independent test set, stratified 10-fold cross-validation (CV), and subgroup analyses. Metrics included balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision, each reported with 95% confidence intervals (CIs). SHAP analysis supported model interpretability.

Results: The PCA-based LR pipeline achieved the best test set performance: balanced accuracy of 0.95 (95% CI: 0.90-0.99), AUC of 0.99 (95% CI: 0.97-1.00), and sensitivity of 0.94 (95% CI: 0.84-1.00). In stratified CV, it maintained strong results (balanced accuracy: 0.86; AUC: 0.97; sensitivity: 0.80), with consistent performance across clinically relevant subgroups. The t-SVD-based LR pipeline showed comparable performance on both test and CV sets. SVM and FNN pipelines also performed robustly (test AUCs > 0.99; CV AUCs > 0.96). RF and KNN had high specificity but slightly lower sensitivity (test: ~0.87; CV: 0.77-0.80). GB pipelines showed the lowest overall performance (test balanced accuracy: 0.86-0.88; CV: 0.85-0.88).

Conclusions: Dimensionality reduction via PCA and t-SVD significantly improved model performance, particularly for LR, SVM, FNN, RF and KNN classifiers. The PCA-based LR pipeline showed the best generalizability, supporting its potential integration into clinical decision-support tools for personalized DTC management.

Clinical trial number: Not applicable.

查看原文本刊更多论文

分化型甲状腺癌复发预测的无监督特征工程与分类管道优化。

背景：分化型甲状腺癌（DTC）是一种常见的内分泌恶性肿瘤，发病率上升，复发率高，但预后良好。准确的复发预测是指导治疗后策略的关键。本研究旨在通过改进特征工程和使用UCI DTC数据集评估不同的机器学习模型集合来提高预测性能。方法：采用无监督数据工程，特别是降维和聚类，来提高特征质量。采用主成分分析法（PCA）和截断奇异值分解法（t-SVD）进行聚类分析，得到了较优的聚类指标：调整后的Rand Index （ARI 0.55）和V-measure（> 0.45）。使用逻辑回归（LR）、支持向量机（SVM）、随机森林（RF）、k近邻（KNN）、前馈神经网络（FNN）和梯度增强（GB）将这些集成到分类管道中。通过独立测试集的自举、分层10倍交叉验证（CV）和亚组分析来评估模型的性能。指标包括平衡准确性、F1评分、AUC、敏感性、特异性和精密度，每项指标均以95%置信区间（ci）报告。SHAP分析支持模型的可解释性。结果：基于pca的LR管道获得了最佳的测试集性能：平衡精度为0.95 (95% CI: 0.90-0.99)， AUC为0.99 (95% CI: 0.97-1.00)，灵敏度为0.94 （95% CI: 0.84-1.00）。在分层CV中，它保持了很强的结果(平衡精度：0.86；AUC: 0.97;敏感性：0.80)，在临床相关亚组中表现一致。基于t- svd的LR管道在测试集和CV集上表现出相当的性能。支持向量机（SVM）和FNN管道也表现良好(测试auc为0.99；CV AUCs（0.96）。RF和KNN特异性高，敏感性略低(试验值：~0.87；简历:0.77 - -0.80)。国标管线整体性能最低(测试平衡精度：0.86-0.88；简历:0.85 - -0.88)。结论：通过PCA和t-SVD降维可以显著提高模型性能，特别是对于LR、SVM、FNN、RF和KNN分类器。基于pca的LR管道显示出最好的通用性，支持其潜在的整合到个性化DTC管理的临床决策支持工具中。临床试验号：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.