BC-predict: mining of signal biomarkers and production of models for early-stage breast cancer subtyping and prognosis.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-09-18 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1644695

Sangeetha Muthamilselvan, Natarajan Vaithilingam, Ashok Palaniappan

{"title":"BC-predict: mining of signal biomarkers and production of models for early-stage breast cancer subtyping and prognosis.","authors":"Sangeetha Muthamilselvan, Natarajan Vaithilingam, Ashok Palaniappan","doi":"10.3389/fbinf.2025.1644695","DOIUrl":null,"url":null,"abstract":"Introduction: Disease heterogeneity is the hallmark of breast cancer, which is the most common female malignancy. With a disturbing increase in mortality and disease burden, there remains a need for effective early-stage theragnostic and prognostic biomarkers. In this work, we improved on BrcaDx (https://apalania.shinyapps.io/brcadx/) for cancer vs control screening and examined a cluster of adjoining learning problems in breast cancer heterogeneity: (i) identification of metastatic cancers; (ii) molecular subtyping (TNBC, HER2, or luminal); and (iii) histological subtyping (invasive ductal or invasive lobular).Methods: We analyzed the transcriptomic profiles of breast cancer patients from public-domain databases such as the TCGA using stage-encoded problem-specific statistical models of gene expression and unveiled stage-salient and progression-significant genes. Using a consensus approach, we identified potential machine learning features, and considered six model classes for each learning problem, with hyperparameter optimization on a training dataset and evaluation on a holdout test dataset. A nested approach enabled us to identify the best model class for each learning problem.Results: External validation of the best models yielded balanced accuracies of 97.42% for cancer vs normal; 88.22% for metastatic v/s non metastatic; 88.79% for ternary molecular subtyping; and ensemble accuracy of 94.23% for histological subtyping. The model for molecular subtyping was validated on a 26-sample TNBC-only out-of-distribution cohort, yielding 25 correct predictions. We performed a late integration of multi-omics datasets by validating the feature space used in each problem with miRNA profiles, methylation profiles, and commercial breast cancer panels.Discussion: Pending prospective studies, we have translated the models into BC-Predict that forks the best models developed for each problem in a unified interface and provides a complete readout for input instances of expression data, including uncertainty estimates. BC-Predict is freely available for non-commercial purposes at: https://apalania.shinyapps.io/BC-Predict.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1644695"},"PeriodicalIF":3.9000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488574/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1644695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Disease heterogeneity is the hallmark of breast cancer, which is the most common female malignancy. With a disturbing increase in mortality and disease burden, there remains a need for effective early-stage theragnostic and prognostic biomarkers. In this work, we improved on BrcaDx (https://apalania.shinyapps.io/brcadx/) for cancer vs control screening and examined a cluster of adjoining learning problems in breast cancer heterogeneity: (i) identification of metastatic cancers; (ii) molecular subtyping (TNBC, HER2, or luminal); and (iii) histological subtyping (invasive ductal or invasive lobular).

Methods: We analyzed the transcriptomic profiles of breast cancer patients from public-domain databases such as the TCGA using stage-encoded problem-specific statistical models of gene expression and unveiled stage-salient and progression-significant genes. Using a consensus approach, we identified potential machine learning features, and considered six model classes for each learning problem, with hyperparameter optimization on a training dataset and evaluation on a holdout test dataset. A nested approach enabled us to identify the best model class for each learning problem.

Results: External validation of the best models yielded balanced accuracies of 97.42% for cancer vs normal; 88.22% for metastatic v/s non metastatic; 88.79% for ternary molecular subtyping; and ensemble accuracy of 94.23% for histological subtyping. The model for molecular subtyping was validated on a 26-sample TNBC-only out-of-distribution cohort, yielding 25 correct predictions. We performed a late integration of multi-omics datasets by validating the feature space used in each problem with miRNA profiles, methylation profiles, and commercial breast cancer panels.

Discussion: Pending prospective studies, we have translated the models into BC-Predict that forks the best models developed for each problem in a unified interface and provides a complete readout for input instances of expression data, including uncertainty estimates. BC-Predict is freely available for non-commercial purposes at: https://apalania.shinyapps.io/BC-Predict.

Abstract Image

查看原文本刊更多论文

BC-predict：挖掘信号生物标志物，建立早期乳腺癌亚型和预后模型。

乳腺癌是最常见的女性恶性肿瘤，疾病异质性是其特征。随着死亡率和疾病负担的令人不安的增加，仍然需要有效的早期诊断和预后生物标志物。在这项工作中，我们改进了BrcaDx （https://apalania.shinyapps.io/brcadx/）用于癌症与对照筛查，并检查了乳腺癌异质性中一系列相邻的学习问题：(i)转移性癌症的识别；（ii）分子分型（TNBC、HER2或luminal）；组织学分型（浸润性导管或浸润性小叶）。方法：我们使用分期编码的问题特异性基因表达统计模型，从公共领域数据库（如TCGA）中分析乳腺癌患者的转录组谱，并揭示分期显著性和进展显著性基因。使用共识方法，我们确定了潜在的机器学习特征，并为每个学习问题考虑了六个模型类，在训练数据集上进行了超参数优化，并在holdout测试数据集上进行了评估。嵌套方法使我们能够为每个学习问题确定最佳的模型类。结果：最佳模型的外部验证获得了97.42%的癌症与正常的平衡精度；转移vs非转移率为88.22%；三元分子分型占88.79%；组织学分型的集合准确率为94.23%。分子分型模型在26个样本中进行了验证，得到了25个正确的预测。我们通过验证miRNA图谱、甲基化图谱和商业乳腺癌小组在每个问题中使用的特征空间，进行了多组学数据集的后期整合。讨论：在进行前瞻性研究之前，我们已经将模型翻译成BC-Predict，该模型在统一的界面中为每个问题开发了最佳模型，并为表达式数据的输入实例提供了完整的读数，包括不确定性估计。BC-Predict免费用于非商业目的：https://apalania.shinyapps.io/BC-Predict。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量