Firas Alghanim, Ibrahim Al-Hurani, H. Qattous, Abdullah Al-Refai, Osamah Batiha, A. Alkhateeb, Salama Ikki
{"title":"用于识别乳腺癌绝经状态的多组学生物标记物的机器学习模型","authors":"Firas Alghanim, Ibrahim Al-Hurani, H. Qattous, Abdullah Al-Refai, Osamah Batiha, A. Alkhateeb, Salama Ikki","doi":"10.3390/a17010013","DOIUrl":null,"url":null,"abstract":"Identifying menopause-related breast cancer biomarkers is crucial for enhancing diagnosis, prognosis, and personalized treatment at that stage of the patient’s life. In this paper, we present a comprehensive framework for extracting multiomics biomarkers specifically related to breast cancer incidence before and after menopause. Our approach integrates DNA methylation, gene expression, and copy number alteration data using a systematic pipeline encompassing data preprocessing and handling class imbalance, dimensionality reduction, and classification. The framework starts with MutSigCV for data preprocessing and ensuring data quality. The Synthetic Minority Over-sampling Technique (SMOTE) up-sampling technique is applied to address the class imbalance representation. Then, Principal Component Analysis (PCA) transforms the DNA methylation, gene expression, and copy number alteration data into a latent space. The purpose is to discard irrelevant variations and extract relevant information. Finally, a classification model is built based on the transformed multiomics data into a unified representation. The framework contributes to understanding the complex interplay between menopause and breast cancer, thereby revealing more precise diagnostic and therapeutic strategies in the future. The explainable artificial intelligence model Shapley based on the XGBoost regressor showed the power of the selected gene expressions for predicting the menopause status, and the potential biomarkers included RUNX1, PTEN, MAP3K1, and CDH1. The literature confirmed the findings.","PeriodicalId":7636,"journal":{"name":"Algorithms","volume":"221 8","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning Model for Multiomics Biomarkers Identification for Menopause Status in Breast Cancer\",\"authors\":\"Firas Alghanim, Ibrahim Al-Hurani, H. Qattous, Abdullah Al-Refai, Osamah Batiha, A. Alkhateeb, Salama Ikki\",\"doi\":\"10.3390/a17010013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying menopause-related breast cancer biomarkers is crucial for enhancing diagnosis, prognosis, and personalized treatment at that stage of the patient’s life. In this paper, we present a comprehensive framework for extracting multiomics biomarkers specifically related to breast cancer incidence before and after menopause. Our approach integrates DNA methylation, gene expression, and copy number alteration data using a systematic pipeline encompassing data preprocessing and handling class imbalance, dimensionality reduction, and classification. The framework starts with MutSigCV for data preprocessing and ensuring data quality. The Synthetic Minority Over-sampling Technique (SMOTE) up-sampling technique is applied to address the class imbalance representation. Then, Principal Component Analysis (PCA) transforms the DNA methylation, gene expression, and copy number alteration data into a latent space. The purpose is to discard irrelevant variations and extract relevant information. Finally, a classification model is built based on the transformed multiomics data into a unified representation. The framework contributes to understanding the complex interplay between menopause and breast cancer, thereby revealing more precise diagnostic and therapeutic strategies in the future. The explainable artificial intelligence model Shapley based on the XGBoost regressor showed the power of the selected gene expressions for predicting the menopause status, and the potential biomarkers included RUNX1, PTEN, MAP3K1, and CDH1. The literature confirmed the findings.\",\"PeriodicalId\":7636,\"journal\":{\"name\":\"Algorithms\",\"volume\":\"221 8\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/a17010013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/a17010013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
确定与更年期相关的乳腺癌生物标志物对于加强该阶段的诊断、预后和个性化治疗至关重要。在本文中,我们提出了一个提取与绝经前后乳腺癌发病率特别相关的多组学生物标志物的综合框架。我们的方法使用一个系统管道整合了 DNA 甲基化、基因表达和拷贝数改变数据,该管道包括数据预处理、类不平衡处理、降维和分类。该框架从 MutSigCV 开始,进行数据预处理并确保数据质量。应用合成少数群体过度采样技术(SMOTE)向上采样技术来处理类不平衡表示。然后,主成分分析法(PCA)将 DNA 甲基化、基因表达和拷贝数改变数据转化为潜在空间。这样做的目的是摒弃无关变异,提取相关信息。最后,根据转换后的多组学数据建立一个统一表示的分类模型。该框架有助于理解更年期与乳腺癌之间复杂的相互作用,从而揭示未来更精确的诊断和治疗策略。基于 XGBoost 回归器的可解释人工智能模型 Shapley 显示了所选基因表达预测绝经状态的能力,潜在的生物标志物包括 RUNX1、PTEN、MAP3K1 和 CDH1。文献证实了这些发现。
Machine Learning Model for Multiomics Biomarkers Identification for Menopause Status in Breast Cancer
Identifying menopause-related breast cancer biomarkers is crucial for enhancing diagnosis, prognosis, and personalized treatment at that stage of the patient’s life. In this paper, we present a comprehensive framework for extracting multiomics biomarkers specifically related to breast cancer incidence before and after menopause. Our approach integrates DNA methylation, gene expression, and copy number alteration data using a systematic pipeline encompassing data preprocessing and handling class imbalance, dimensionality reduction, and classification. The framework starts with MutSigCV for data preprocessing and ensuring data quality. The Synthetic Minority Over-sampling Technique (SMOTE) up-sampling technique is applied to address the class imbalance representation. Then, Principal Component Analysis (PCA) transforms the DNA methylation, gene expression, and copy number alteration data into a latent space. The purpose is to discard irrelevant variations and extract relevant information. Finally, a classification model is built based on the transformed multiomics data into a unified representation. The framework contributes to understanding the complex interplay between menopause and breast cancer, thereby revealing more precise diagnostic and therapeutic strategies in the future. The explainable artificial intelligence model Shapley based on the XGBoost regressor showed the power of the selected gene expressions for predicting the menopause status, and the potential biomarkers included RUNX1, PTEN, MAP3K1, and CDH1. The literature confirmed the findings.