Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering.

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-07-01 DOI:10.1186/s12859-025-06193-2

FuDong Wen, Yue Su, Dan Liu, YuPeng Wang, MeiNa Liu

{"title":"Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering.","authors":"FuDong Wen, Yue Su, Dan Liu, YuPeng Wang, MeiNa Liu","doi":"10.1186/s12859-025-06193-2","DOIUrl":null,"url":null,"abstract":"Background: High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise.Results: Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS's superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20-50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS's classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS's ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"165"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12220089/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06193-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise.

Results: Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS's superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20-50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS's classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS's ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.

Abstract Image

查看原文本刊更多论文

基于1位压缩感知和K-Medoids聚类的高维蛋白质组学数据自动稀疏特征选择。

背景：由于技术噪声、特征冗余和多重共线性，高维蛋白质组学数据在生物标志物发现方面面临重大挑战。当前的特征选择方法，包括过滤器、包装器和嵌入式方法，在稳定性、稀疏性和计算效率方面存在问题。为了解决这些限制，我们提出了软阈值压缩感知（ST-CS），这是一个将1位压缩感知与K-Medoids聚类集成在一起的混合框架。与依赖手动阈值的传统方法不同，ST-CS通过将系数大小动态划分为判别性生物标志物和噪声来实现特征选择的自动化。结果：对模拟和现实世界蛋白质组学数据集的评估表明ST-CS在特征选择能力和分类性能方面具有优势。在模拟中，ST-CS实现了平衡灵敏度（> 80%）和特异性（> 99.8%）的特征选择鲁棒性，与硬阈值压缩感知（HT-CS）相比，将错误发现率（FDR）降低了20-50%。此外，它获得了更高的F1分数和马修斯相关系数（MCC），在识别真正的生物标志物和抑制噪声方面优于HT-CS、LASSO和SPLSDA。在分类性能方面，ST-CS在保持稀疏性的同时，在不同噪声水平下的接收者工作特征曲线（AUC）下面积超过了所有方法。应用于临床蛋白质组学肿瘤分析联盟（CPTAC）数据集，ST-CS与HT-CS的分类准确性相匹配（肝内胆管癌的AUC = 97.47%），但选择的特征减少了57%(37比86)，显示了其在精确生物标志物发现和预测准确性方面的双重优势。对于胶质母细胞瘤数据，ST-CS的AUC（72.71%）高于HT-CS（72.15%）、LASSO（67.80%）和SPLSDA(71.38%)，同时保留了简洁的特征集（30个特征vs. HT-CS的58个特征）。在卵巢浆液性囊腺癌中，ST-CS进一步显示了其适应性，仅用24±5个选定的生物标志物，其AUC（75.86%）优于HT-CS（75.61%）、LASSO（61.00%）和SPLSDA（70.75%）。这些结果突出了ST-CS严格自动化特征选择的能力，同时平衡了翻译蛋白质组学的分类效率、可解释性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.