Data-Driven Drug Discovery Optimization for Breast Cancer Using Interpretable Machine Learning Models.

IF 1.2 4区综合性期刊 Q3 MULTIDISCIPLINARY SCIENCES

Jove-Journal of Visualized Experiments Pub Date : 2025-09-12 DOI:10.3791/68705

Dyuti Banerjee, Sivaneasan Bala Krishnan, Kamal Upreti, Sumegh Shrikant Tharewal, Uma Shankar, Pravin Kshirsagar, Manoj Kumar

{"title":"Data-Driven Drug Discovery Optimization for Breast Cancer Using Interpretable Machine Learning Models.","authors":"Dyuti Banerjee, Sivaneasan Bala Krishnan, Kamal Upreti, Sumegh Shrikant Tharewal, Uma Shankar, Pravin Kshirsagar, Manoj Kumar","doi":"10.3791/68705","DOIUrl":null,"url":null,"abstract":"Breast cancer remains one of the most prevalent malignancies worldwide, posing significant therapeutic challenges due to tumor heterogeneity and drug resistance. This study presents a reproducible, data-driven machine learning protocol for predicting drug sensitivity in breast cancer cell lines, with the dual objective of identifying potent single agents and synergistic drug combinations. Using curated datasets from the Genomics of Drug Sensitivity in Cancer (GDSC), two predictive approaches were implemented: a standalone XGBoost regressor and a hybrid Autoencoder-XGBoost pipeline. Preprocessing included label encoding, one-hot encoding, Z-score standardization, missing value imputation, and dimensionality reduction via PCA. Model evaluation demonstrated that XGBoost achieved superior performance (MSE = 1.3789, R2 = 0.8145) compared to the hybrid model (MSE = 4.0322, R2 = 0.4577). Interpretability was addressed using SHapley Additive exPlanations (SHAP), which identified TARGET_PATHWAY, DRUG_ID, TARGET, and CELL_LINE_NAME as key predictive features, aligning with established pharmacological mechanisms. Predicted synergy scores, derived from combining model outputs with DrugComb and SynergyDB data, highlighted promising drug pairs such as Bortezomib + Romidepsin and Paclitaxel + Bortezomib. These findings were further supported by PCA-based pharmacological clustering, revealing biologically relevant groupings of drugs with similar mechanisms of action. The proposed protocol provides a transparent and adaptable framework for precision oncology research, enabling both predictive accuracy and biological interpretability. By integrating rigorous preprocessing, model validation, explainability, and drug synergy analysis, this workflow offers a scalable foundation for translational drug discovery and repurposing in breast cancer treatment.","PeriodicalId":48787,"journal":{"name":"Jove-Journal of Visualized Experiments","volume":" 223","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jove-Journal of Visualized Experiments","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3791/68705","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Breast cancer remains one of the most prevalent malignancies worldwide, posing significant therapeutic challenges due to tumor heterogeneity and drug resistance. This study presents a reproducible, data-driven machine learning protocol for predicting drug sensitivity in breast cancer cell lines, with the dual objective of identifying potent single agents and synergistic drug combinations. Using curated datasets from the Genomics of Drug Sensitivity in Cancer (GDSC), two predictive approaches were implemented: a standalone XGBoost regressor and a hybrid Autoencoder-XGBoost pipeline. Preprocessing included label encoding, one-hot encoding, Z-score standardization, missing value imputation, and dimensionality reduction via PCA. Model evaluation demonstrated that XGBoost achieved superior performance (MSE = 1.3789, R² = 0.8145) compared to the hybrid model (MSE = 4.0322, R² = 0.4577). Interpretability was addressed using SHapley Additive exPlanations (SHAP), which identified TARGET_PATHWAY, DRUG_ID, TARGET, and CELL_LINE_NAME as key predictive features, aligning with established pharmacological mechanisms. Predicted synergy scores, derived from combining model outputs with DrugComb and SynergyDB data, highlighted promising drug pairs such as Bortezomib + Romidepsin and Paclitaxel + Bortezomib. These findings were further supported by PCA-based pharmacological clustering, revealing biologically relevant groupings of drugs with similar mechanisms of action. The proposed protocol provides a transparent and adaptable framework for precision oncology research, enabling both predictive accuracy and biological interpretability. By integrating rigorous preprocessing, model validation, explainability, and drug synergy analysis, this workflow offers a scalable foundation for translational drug discovery and repurposing in breast cancer treatment.

查看原文本刊更多论文

使用可解释机器学习模型的数据驱动的乳腺癌药物发现优化。

乳腺癌仍然是世界上最常见的恶性肿瘤之一，由于肿瘤的异质性和耐药性，给治疗带来了重大挑战。本研究提出了一种可重复的、数据驱动的机器学习方案，用于预测乳腺癌细胞系的药物敏感性，其双重目标是识别有效的单一药物和协同药物组合。使用来自癌症药物敏感性基因组学（GDSC）的数据集，实现了两种预测方法：独立的XGBoost回归器和混合的Autoencoder-XGBoost管道。预处理包括标签编码、单热编码、z评分标准化、缺失值输入和PCA降维。模型评价表明，XGBoost的性能优于混合模型（MSE = 4.0322, R2 = 0.4577）（MSE = 1.3789, R2 = 0.8145）。使用SHapley加法解释（SHAP）解决可解释性问题，该方法确定TARGET_PATHWAY， DRUG_ID， TARGET和CELL_LINE_NAME为关键预测特征，与已建立的药理机制一致。通过将模型输出与DrugComb和SynergyDB数据相结合得出的预测协同评分突出了有前景的药物对，如硼替佐米+罗米地辛和紫杉醇+硼替佐米。基于pca的药理学聚类进一步支持了这些发现，揭示了具有相似作用机制的药物的生物学相关分组。提出的方案为精确肿瘤学研究提供了一个透明和适应性强的框架，使预测准确性和生物学可解释性成为可能。通过整合严格的预处理、模型验证、可解释性和药物协同分析，该工作流程为乳腺癌治疗中的转化药物发现和再利用提供了可扩展的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Jove-Journal of Visualized Experiments MULTIDISCIPLINARY SCIENCES-

CiteScore

2.10

自引率

0.00%

发文量

992

期刊介绍： JoVE, the Journal of Visualized Experiments, is the world''s first peer reviewed scientific video journal. Established in 2006, JoVE is devoted to publishing scientific research in a visual format to help researchers overcome two of the biggest challenges facing the scientific research community today; poor reproducibility and the time and labor intensive nature of learning new experimental techniques.