Comparative investigation of lung adenocarcinoma and squamous cell carcinoma transcriptome to reveal potential candidate biomarkers: An explainable AI approach
{"title":"Comparative investigation of lung adenocarcinoma and squamous cell carcinoma transcriptome to reveal potential candidate biomarkers: An explainable AI approach","authors":"Ankur Datta, George Priya Doss. C","doi":"10.1016/j.compbiolchem.2024.108333","DOIUrl":null,"url":null,"abstract":"<div><div>Patients with Non-Small Cell Lung Cancer (NSCLC) present a variety of clinical symptoms, such as dyspnea and chest pain, complicating accurate diagnosis. NSCLC includes subtypes distinguished by histological characteristics, specifically lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). This study aims to compare and identify abnormal gene expression patterns in LUAD and LUSC samples relative to adjacent healthy tissues using an explainable artificial intelligence (XAI) framework. The LASSO algorithm was employed to identify the top gene features in the LUAD and LUSC datasets. An ensemble-based extreme gradient boosting (XGBoost) machine learning (ML) algorithm was trained and interpreted using SHapley Additive exPlanations (SHAP), with top features undergoing biological annotation through survival and functional enrichment analyses. The XAI-based SHAP module addresses the opaque nature of ML models. Notably, 35 and 33 genes were identified for LUAD and LUSC, respectively, using the LASSO algorithm. Performance metrics such as average accuracy and Matthew’s correlation coefficient were evaluated. The XGBoost model demonstrated an average accuracy of 99.1 % for LUAD and 98.6 % for LUSC. The <em>SFTPC</em> gene emerged as the most significant feature across both NSCLC subtypes. For LUAD, genes such as <em>STX11</em>, <em>CLEC3B</em>, <em>EMP2</em>, and <em>LYVE1</em> significantly influenced the XAI-SHAP framework. Conversely, <em>GKN2</em>, <em>OGN</em>, <em>SLC39A8</em>, and <em>MMRN1</em> were identified for LUSC. Survival analysis and functional validation of these genes highlighted the physiological functions observed to be dysregulated in the NSCLC subtypes. These identified genes have the potential to enhance current medical diagnostics and therapeutics.</div></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":"115 ","pages":"Article 108333"},"PeriodicalIF":2.6000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124003219","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Patients with Non-Small Cell Lung Cancer (NSCLC) present a variety of clinical symptoms, such as dyspnea and chest pain, complicating accurate diagnosis. NSCLC includes subtypes distinguished by histological characteristics, specifically lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). This study aims to compare and identify abnormal gene expression patterns in LUAD and LUSC samples relative to adjacent healthy tissues using an explainable artificial intelligence (XAI) framework. The LASSO algorithm was employed to identify the top gene features in the LUAD and LUSC datasets. An ensemble-based extreme gradient boosting (XGBoost) machine learning (ML) algorithm was trained and interpreted using SHapley Additive exPlanations (SHAP), with top features undergoing biological annotation through survival and functional enrichment analyses. The XAI-based SHAP module addresses the opaque nature of ML models. Notably, 35 and 33 genes were identified for LUAD and LUSC, respectively, using the LASSO algorithm. Performance metrics such as average accuracy and Matthew’s correlation coefficient were evaluated. The XGBoost model demonstrated an average accuracy of 99.1 % for LUAD and 98.6 % for LUSC. The SFTPC gene emerged as the most significant feature across both NSCLC subtypes. For LUAD, genes such as STX11, CLEC3B, EMP2, and LYVE1 significantly influenced the XAI-SHAP framework. Conversely, GKN2, OGN, SLC39A8, and MMRN1 were identified for LUSC. Survival analysis and functional validation of these genes highlighted the physiological functions observed to be dysregulated in the NSCLC subtypes. These identified genes have the potential to enhance current medical diagnostics and therapeutics.
期刊介绍:
Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered.
Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered.
Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.