SurvMarker: an R package for identifying survival-associated molecular features using PCA-based weighted scores.

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2026-05-08 DOI:10.1186/s12859-026-06461-9

Dona Hasini Gammune, Tongjun Gu

{"title":"SurvMarker: an R package for identifying survival-associated molecular features using PCA-based weighted scores.","authors":"Dona Hasini Gammune, Tongjun Gu","doi":"10.1186/s12859-026-06461-9","DOIUrl":null,"url":null,"abstract":"Background: Identification of prognostic molecular features from high-dimensional molecular data is central to biomarker discovery in cancer and other complex diseases. Principal component analysis (PCA) is widely used for dimensionality reduction in survival studies, yet selecting individual features from principal components (PCs) remains challenging and often relies on arbitrary thresholds. To address this limitation, we developed SurvMarker, an R package that prioritizes survival-associated molecular features using a PCA-based scoring framework.Results: SurvMarker applies PCA to normalized molecular data, jointly evaluates PCs using multivariable Cox proportional hazards models, and ranks features by aggregating absolute loadings across survival-associated PCs. Feature significance is assessed using an empirical null framework with false discovery rate control. In both synthetic global-null and permutation-based null simulations, SurvMarker showed comparative or better false positive control, particularly in small-n, large-p settings, compared with LASSO Cox, Elastic Net Cox, and Partial Least Squares Cox, while maintaining well-calibrated null p-value distributions. In the TCGA-LAML cohort, SurvMarker achieved the best predictive performance among these methods for gene expression data, with a C-index of 0.78 and an overall time-dependent AUC of 0.882 with similar applicability to miRNA expression data. Compared with sparse PCA-based and fixed per-PC threshold approaches, SurvMarker also achieved better predictive performance and yielded more compact, stable feature sets across different PC settings.Conclusions: SurvMarker provides a robust, interpretable, and reproducible framework for identifying survival-associated molecular features from high-dimensional data. By combining survival-guided PC selection, weighted feature aggregation across PCs, and empirical null-based inference, it improves false discovery control, stability, and biological relevance, and offers a practical tool for biomarker discovery across multiple omics data types.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-026-06461-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Identification of prognostic molecular features from high-dimensional molecular data is central to biomarker discovery in cancer and other complex diseases. Principal component analysis (PCA) is widely used for dimensionality reduction in survival studies, yet selecting individual features from principal components (PCs) remains challenging and often relies on arbitrary thresholds. To address this limitation, we developed SurvMarker, an R package that prioritizes survival-associated molecular features using a PCA-based scoring framework.

Results: SurvMarker applies PCA to normalized molecular data, jointly evaluates PCs using multivariable Cox proportional hazards models, and ranks features by aggregating absolute loadings across survival-associated PCs. Feature significance is assessed using an empirical null framework with false discovery rate control. In both synthetic global-null and permutation-based null simulations, SurvMarker showed comparative or better false positive control, particularly in small-n, large-p settings, compared with LASSO Cox, Elastic Net Cox, and Partial Least Squares Cox, while maintaining well-calibrated null p-value distributions. In the TCGA-LAML cohort, SurvMarker achieved the best predictive performance among these methods for gene expression data, with a C-index of 0.78 and an overall time-dependent AUC of 0.882 with similar applicability to miRNA expression data. Compared with sparse PCA-based and fixed per-PC threshold approaches, SurvMarker also achieved better predictive performance and yielded more compact, stable feature sets across different PC settings.

Conclusions: SurvMarker provides a robust, interpretable, and reproducible framework for identifying survival-associated molecular features from high-dimensional data. By combining survival-guided PC selection, weighted feature aggregation across PCs, and empirical null-based inference, it improves false discovery control, stability, and biological relevance, and offers a practical tool for biomarker discovery across multiple omics data types.

查看原文本刊更多论文

SurvMarker：一个R软件包，用于使用基于pca的加权评分来识别生存相关的分子特征。

背景：从高维分子数据中识别预后分子特征对于发现癌症和其他复杂疾病的生物标志物至关重要。主成分分析（PCA）广泛用于生存研究的降维，但从主成分（pc）中选择单个特征仍然具有挑战性，并且通常依赖于任意阈值。为了解决这一限制，我们开发了SurvMarker，这是一个R包，使用基于pca的评分框架对生存相关的分子特征进行优先排序。结果：SurvMarker将PCA应用于规范化的分子数据，使用多变量Cox比例风险模型联合评估pc，并通过聚合生存相关pc的绝对负荷来对特征进行排序。使用带有错误发现率控制的经验零框架评估特征重要性。与LASSO Cox、Elastic Net Cox和偏最小二乘Cox相比，在合成全局零和基于排列的零模拟中，SurvMarker显示出相对或更好的假阳性控制，特别是在小n、大p设置中，同时保持校准良好的零p值分布。在TCGA-LAML队列中，在这些方法中，SurvMarker对基因表达数据的预测性能最好，其c指数为0.78，总体时间依赖性AUC为0.882，对miRNA表达数据的适用性相似。与基于稀疏pca和固定每台PC阈值的方法相比，SurvMarker还获得了更好的预测性能，并在不同的PC设置中产生了更紧凑、更稳定的特征集。结论：SurvMarker为从高维数据中识别生存相关的分子特征提供了一个强大的、可解释的、可重复的框架。通过结合生存导向的PC选择、跨PC的加权特征聚合和基于经验的基于null的推理，它提高了错误发现的控制、稳定性和生物相关性，并为跨多种组学数据类型的生物标志物发现提供了实用的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.