Comparative investigation of lung adenocarcinoma and squamous cell carcinoma transcriptome to reveal potential candidate biomarkers: An explainable AI approach

IF 2.6 4区 生物学 Q2 BIOLOGY
Ankur Datta, George Priya Doss. C
{"title":"Comparative investigation of lung adenocarcinoma and squamous cell carcinoma transcriptome to reveal potential candidate biomarkers: An explainable AI approach","authors":"Ankur Datta,&nbsp;George Priya Doss. C","doi":"10.1016/j.compbiolchem.2024.108333","DOIUrl":null,"url":null,"abstract":"<div><div>Patients with Non-Small Cell Lung Cancer (NSCLC) present a variety of clinical symptoms, such as dyspnea and chest pain, complicating accurate diagnosis. NSCLC includes subtypes distinguished by histological characteristics, specifically lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). This study aims to compare and identify abnormal gene expression patterns in LUAD and LUSC samples relative to adjacent healthy tissues using an explainable artificial intelligence (XAI) framework. The LASSO algorithm was employed to identify the top gene features in the LUAD and LUSC datasets. An ensemble-based extreme gradient boosting (XGBoost) machine learning (ML) algorithm was trained and interpreted using SHapley Additive exPlanations (SHAP), with top features undergoing biological annotation through survival and functional enrichment analyses. The XAI-based SHAP module addresses the opaque nature of ML models. Notably, 35 and 33 genes were identified for LUAD and LUSC, respectively, using the LASSO algorithm. Performance metrics such as average accuracy and Matthew’s correlation coefficient were evaluated. The XGBoost model demonstrated an average accuracy of 99.1 % for LUAD and 98.6 % for LUSC. The <em>SFTPC</em> gene emerged as the most significant feature across both NSCLC subtypes. For LUAD, genes such as <em>STX11</em>, <em>CLEC3B</em>, <em>EMP2</em>, and <em>LYVE1</em> significantly influenced the XAI-SHAP framework. Conversely, <em>GKN2</em>, <em>OGN</em>, <em>SLC39A8</em>, and <em>MMRN1</em> were identified for LUSC. Survival analysis and functional validation of these genes highlighted the physiological functions observed to be dysregulated in the NSCLC subtypes. These identified genes have the potential to enhance current medical diagnostics and therapeutics.</div></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":"115 ","pages":"Article 108333"},"PeriodicalIF":2.6000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124003219","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Patients with Non-Small Cell Lung Cancer (NSCLC) present a variety of clinical symptoms, such as dyspnea and chest pain, complicating accurate diagnosis. NSCLC includes subtypes distinguished by histological characteristics, specifically lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). This study aims to compare and identify abnormal gene expression patterns in LUAD and LUSC samples relative to adjacent healthy tissues using an explainable artificial intelligence (XAI) framework. The LASSO algorithm was employed to identify the top gene features in the LUAD and LUSC datasets. An ensemble-based extreme gradient boosting (XGBoost) machine learning (ML) algorithm was trained and interpreted using SHapley Additive exPlanations (SHAP), with top features undergoing biological annotation through survival and functional enrichment analyses. The XAI-based SHAP module addresses the opaque nature of ML models. Notably, 35 and 33 genes were identified for LUAD and LUSC, respectively, using the LASSO algorithm. Performance metrics such as average accuracy and Matthew’s correlation coefficient were evaluated. The XGBoost model demonstrated an average accuracy of 99.1 % for LUAD and 98.6 % for LUSC. The SFTPC gene emerged as the most significant feature across both NSCLC subtypes. For LUAD, genes such as STX11, CLEC3B, EMP2, and LYVE1 significantly influenced the XAI-SHAP framework. Conversely, GKN2, OGN, SLC39A8, and MMRN1 were identified for LUSC. Survival analysis and functional validation of these genes highlighted the physiological functions observed to be dysregulated in the NSCLC subtypes. These identified genes have the potential to enhance current medical diagnostics and therapeutics.
肺腺癌和鳞状细胞癌转录组的比较研究揭示潜在的候选生物标志物:一种可解释的人工智能方法。
非小细胞肺癌(NSCLC)患者表现出多种临床症状,如呼吸困难和胸痛,使准确诊断复杂化。NSCLC包括以组织学特征区分的亚型,特别是肺腺癌(LUAD)和肺鳞状细胞癌(LUSC)。本研究旨在利用可解释的人工智能(XAI)框架,比较和识别LUAD和LUSC样本相对于邻近健康组织的异常基因表达模式。采用LASSO算法对LUAD和LUSC数据集中的顶级基因特征进行识别。使用SHapley加性解释(SHAP)对基于集合的极端梯度增强(XGBoost)机器学习(ML)算法进行训练和解释,并通过生存和功能富集分析对顶级特征进行生物学注释。基于xai的SHAP模块解决了ML模型的不透明特性。值得注意的是,使用LASSO算法分别鉴定出35个和33个与LUAD和LUSC相关的基因。评估了平均准确率和马修相关系数等性能指标。XGBoost模型对LUAD的平均准确率为99.1 %,对LUSC的平均准确率为98.6 %。SFTPC基因是两种NSCLC亚型中最重要的特征。对于LUAD, STX11、cle3b、EMP2和LYVE1等基因显著影响了XAI-SHAP框架。相反,GKN2, OGN, SLC39A8和MMRN1被鉴定为LUSC。这些基因的生存分析和功能验证强调了在NSCLC亚型中观察到的生理功能失调。这些已识别的基因有可能增强当前的医学诊断和治疗方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational Biology and Chemistry
Computational Biology and Chemistry 生物-计算机:跨学科应用
CiteScore
6.10
自引率
3.20%
发文量
142
审稿时长
24 days
期刊介绍: Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信