Interpretable Machine Learning for Proteomics-Based Subtyping and Tumor Mutational Burden Prediction in Endometrial Cancer.

IF 2.5 4区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

PROTEOMICS – Clinical Applications Pub Date : 2025-09-08 DOI:10.1002/prca.70024

Thi-My-Trang Luong, Xuan Lam Bui, Chii-Ruey Tzeng, Nguyen Quoc Khanh Le

{"title":"Interpretable Machine Learning for Proteomics-Based Subtyping and Tumor Mutational Burden Prediction in Endometrial Cancer.","authors":"Thi-My-Trang Luong, Xuan Lam Bui, Chii-Ruey Tzeng, Nguyen Quoc Khanh Le","doi":"10.1002/prca.70024","DOIUrl":null,"url":null,"abstract":"Background: Endometrial carcinoma (EC) represents a significant clinical challenge due to its pronounced molecular heterogeneity, directly influencing prognosis and therapeutic responses. Accurate classification of molecular subtypes (CNV-high, CNV-low, MSI-H, POLE) and precise tumor mutational burden (TMB) assessment is crucial for guiding personalized therapeutic interventions. Integrating proteomics data with advanced machine learning (ML) techniques offers a promising strategy for achieving precise, clinically actionable classification and biomarker discovery in EC.Materials and methods: Using proteomic data from 95 EC patients (83 endometrioid, 12 serous), sourced from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), we developed an ML pipeline integrating proteomic feature selection (Lasso-penalized logistic regression), classification modeling, and interpretability analysis. The dataset was divided into training (70%) and test (30%) sets, with synthetic minority oversampling (SMOTE) applied to address the class imbalance. Logistic regression models were trained for molecular subtypes classification, and the TMB prediction model performance was evaluated using accuracy, AUC, precision, recall, and F1-score. Model interpretability was enhanced using explainable AI (XAI) techniques: SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME).Results: Feature selection reduced the proteomic dataset from 11,000 to eight key proteins. The proteomics-based ML model demonstrated robust predictive performance, accurately classifying EC molecular subtypes (accuracy: 82.8%; AUC: 0.990) and distinguishing high (≥10 mutations/Mb) versus low TMB (<10 mutations/Mb) cases (accuracy: 89.7%; AUC: 0.984). SHAP analysis highlighted clinically recognized biomarkers (MLH1, PMS2, STAT1) and identified novel protein candidates (MTHFD2, MAST4, RPL22L1, MX2, SEC16A). LIME analysis provided individualized prediction interpretations, clarifying each protein biomarker's influence on model decisions.Conclusion: Our proteomics-driven ML approach demonstrates high accuracy and interpretability in EC subtype classification and TMB prediction. By identifying validated and novel biomarkers, this strategy provides essential biological insights and a strong foundation for the future development of non-invasive diagnostics, personalized treatments, and precision medicine in EC.","PeriodicalId":20571,"journal":{"name":"PROTEOMICS – Clinical Applications","volume":" ","pages":"e70024"},"PeriodicalIF":2.5000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PROTEOMICS – Clinical Applications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/prca.70024","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Endometrial carcinoma (EC) represents a significant clinical challenge due to its pronounced molecular heterogeneity, directly influencing prognosis and therapeutic responses. Accurate classification of molecular subtypes (CNV-high, CNV-low, MSI-H, POLE) and precise tumor mutational burden (TMB) assessment is crucial for guiding personalized therapeutic interventions. Integrating proteomics data with advanced machine learning (ML) techniques offers a promising strategy for achieving precise, clinically actionable classification and biomarker discovery in EC.

Materials and methods: Using proteomic data from 95 EC patients (83 endometrioid, 12 serous), sourced from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), we developed an ML pipeline integrating proteomic feature selection (Lasso-penalized logistic regression), classification modeling, and interpretability analysis. The dataset was divided into training (70%) and test (30%) sets, with synthetic minority oversampling (SMOTE) applied to address the class imbalance. Logistic regression models were trained for molecular subtypes classification, and the TMB prediction model performance was evaluated using accuracy, AUC, precision, recall, and F1-score. Model interpretability was enhanced using explainable AI (XAI) techniques: SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME).

Results: Feature selection reduced the proteomic dataset from 11,000 to eight key proteins. The proteomics-based ML model demonstrated robust predictive performance, accurately classifying EC molecular subtypes (accuracy: 82.8%; AUC: 0.990) and distinguishing high (≥10 mutations/Mb) versus low TMB (<10 mutations/Mb) cases (accuracy: 89.7%; AUC: 0.984). SHAP analysis highlighted clinically recognized biomarkers (MLH1, PMS2, STAT1) and identified novel protein candidates (MTHFD2, MAST4, RPL22L1, MX2, SEC16A). LIME analysis provided individualized prediction interpretations, clarifying each protein biomarker's influence on model decisions.

Conclusion: Our proteomics-driven ML approach demonstrates high accuracy and interpretability in EC subtype classification and TMB prediction. By identifying validated and novel biomarkers, this strategy provides essential biological insights and a strong foundation for the future development of non-invasive diagnostics, personalized treatments, and precision medicine in EC.

查看原文本刊更多论文

基于蛋白质组学的子宫内膜癌亚型分型和肿瘤突变负担预测的可解释机器学习。

背景：子宫内膜癌（EC）由于其明显的分子异质性，直接影响预后和治疗反应，是一个重大的临床挑战。准确的分子亚型分类（CNV-high、CNV-low、MSI-H、POLE）和精确的肿瘤突变负担（TMB）评估对于指导个性化治疗干预至关重要。将蛋白质组学数据与先进的机器学习（ML）技术相结合，为实现EC的精确、临床可操作的分类和生物标志物发现提供了一种有前途的策略。材料和方法：利用来自临床蛋白质组学肿瘤分析联盟（CPTAC）的95例EC患者（83例子宫内膜样，12例浆液）的蛋白质组学数据，我们开发了一个整合蛋白质组学特征选择（laso -惩罚逻辑回归）、分类建模和可解释性分析的ML管道。将数据集分为训练集（70%）和测试集（30%），采用合成少数过采样（SMOTE）来解决类不平衡问题。对Logistic回归模型进行分子亚型分类训练，并通过准确性、AUC、精密度、召回率和f1评分对TMB预测模型的性能进行评价。模型可解释性通过可解释人工智能（XAI）技术增强：SHapley加性解释（SHAP）和局部可解释模型不可知论解释（LIME）。结果：特征选择将蛋白质组学数据集从11,000个减少到8个关键蛋白质。基于蛋白质组学的ML模型显示出强大的预测性能，可以准确地分类EC分子亚型（准确率：82.8%;AUC: 0.990），并区分高（≥10个突变/Mb）和低TMB(结论：我们的蛋白质组学驱动的ML方法在EC亚型分类和TMB预测中具有很高的准确性和可解释性。通过识别经过验证的新型生物标志物，该策略为EC的非侵入性诊断、个性化治疗和精准医学的未来发展提供了必要的生物学见解和坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PROTEOMICS – Clinical Applications 医学-生化研究方法

CiteScore

5.20

自引率

5.00%

发文量

审稿时长

1 months

期刊介绍： PROTEOMICS - Clinical Applications has developed into a key source of information in the field of applying proteomics to the study of human disease and translation to the clinic. With 12 issues per year, the journal will publish papers in all relevant areas including: -basic proteomic research designed to further understand the molecular mechanisms underlying dysfunction in human disease -the results of proteomic studies dedicated to the discovery and validation of diagnostic and prognostic disease biomarkers -the use of proteomics for the discovery of novel drug targets -the application of proteomics in the drug development pipeline -the use of proteomics as a component of clinical trials.