Application of Generative Adversarial Networks on RNASeq data to uncover COVID-19 severity biomarkers

Advances in biomarker sciences and technology Pub Date : 2025-01-01 DOI:10.1016/j.abst.2025.01.002

Yvette K. Kalimumbalo , Rosaline W. Macharia , Peter W. Wagacha

{"title":"Application of Generative Adversarial Networks on RNASeq data to uncover COVID-19 severity biomarkers","authors":"Yvette K. Kalimumbalo , Rosaline W. Macharia , Peter W. Wagacha","doi":"10.1016/j.abst.2025.01.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The COVID-19 pandemic has highlighted the need for reliable biomarkers to predict disease severity and guide treatment strategies. However, the analysis of RNASeq data for biomarker discovery using machine learning is constrained by limited sample sizes, primarily due to cost and privacy considerations. In this study, we applied Generative Adversarial Networks (GANs) to RNASeq data in the process of identifying biomarkers associated with COVID-19 severity.</div></div><div><h3>Methods</h3><div>RNASeq data from COVID-19 patients, along with severity metadata, were collected from the GEO database. Differential expression analysis was conducted and GAN models were trained to augment the original dataset. This enhanced subsequent machine learning models’ robustness and accuracy for biomarker discovery. Feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) identified key biomarkers on cGAN- and cWGAN-augmented datasets.</div></div><div><h3>Results</h3><div>Several key biomarkers significantly associated with disease severity were identified. Gene Ontology Enrichment analysis revealed upregulation of neutrophil degranulation and downregulation of T-cell activity, consistent with previous findings. The ROC analysis using a Random Forest machine learning model and the five most important biomarkers (CCDC65, ZNF239, OTUD7A, CEP126, and TCTN2) achieved high accuracy (AUC: 0.98, Acc: 0.94) in predicting disease severity. These genes are associated with processes such as cilium assembly, IFN activation, and NF-kB pathway suppression.</div></div><div><h3>Conclusions</h3><div>Our results demonstrate that GANs can effectively augment RNASeq data, leading to consistent findings that align with known mechanisms and providing new insights into severe COVID-19 transcriptional responses. Further experimental validation is needed to confirm the applicability of these biomarkers in diverse populations.</div></div>","PeriodicalId":72080,"journal":{"name":"Advances in biomarker sciences and technology","volume":"7 ","pages":"Pages 44-58"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in biomarker sciences and technology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S254310642500002X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

The COVID-19 pandemic has highlighted the need for reliable biomarkers to predict disease severity and guide treatment strategies. However, the analysis of RNASeq data for biomarker discovery using machine learning is constrained by limited sample sizes, primarily due to cost and privacy considerations. In this study, we applied Generative Adversarial Networks (GANs) to RNASeq data in the process of identifying biomarkers associated with COVID-19 severity.

Methods

RNASeq data from COVID-19 patients, along with severity metadata, were collected from the GEO database. Differential expression analysis was conducted and GAN models were trained to augment the original dataset. This enhanced subsequent machine learning models’ robustness and accuracy for biomarker discovery. Feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) identified key biomarkers on cGAN- and cWGAN-augmented datasets.

Results

Several key biomarkers significantly associated with disease severity were identified. Gene Ontology Enrichment analysis revealed upregulation of neutrophil degranulation and downregulation of T-cell activity, consistent with previous findings. The ROC analysis using a Random Forest machine learning model and the five most important biomarkers (CCDC65, ZNF239, OTUD7A, CEP126, and TCTN2) achieved high accuracy (AUC: 0.98, Acc: 0.94) in predicting disease severity. These genes are associated with processes such as cilium assembly, IFN activation, and NF-kB pathway suppression.

Conclusions

Our results demonstrate that GANs can effectively augment RNASeq data, leading to consistent findings that align with known mechanisms and providing new insights into severe COVID-19 transcriptional responses. Further experimental validation is needed to confirm the applicability of these biomarkers in diverse populations.

查看原文本刊更多论文

生成对抗网络在RNASeq数据上的应用揭示COVID-19严重程度生物标志物

2019冠状病毒病大流行凸显了需要可靠的生物标志物来预测疾病严重程度和指导治疗策略。然而，使用机器学习对RNASeq数据进行生物标记物发现分析受到样本量有限的限制，主要是由于成本和隐私方面的考虑。在这项研究中，我们将生成对抗网络（GANs）应用于RNASeq数据，以识别与COVID-19严重程度相关的生物标志物。方法从GEO数据库中收集COVID-19患者的srnaseq数据以及严重程度元数据。进行差异表达分析，并训练GAN模型来增强原始数据集。这增强了后续机器学习模型在生物标志物发现方面的鲁棒性和准确性。使用递归特征消除交叉验证（RFECV）进行特征选择，确定了cGAN和cwgan增强数据集上的关键生物标志物。结果确定了几个与疾病严重程度显著相关的关键生物标志物。基因本体富集分析显示中性粒细胞脱颗粒上调和t细胞活性下调，与先前的研究结果一致。使用随机森林机器学习模型和五个最重要的生物标志物（CCDC65、ZNF239、OTUD7A、CEP126和TCTN2）进行ROC分析，在预测疾病严重程度方面取得了很高的准确性（AUC: 0.98, Acc: 0.94）。这些基因与纤毛组装、IFN激活和NF-kB通路抑制等过程有关。研究结果表明，GANs可以有效地增强RNASeq数据，从而获得与已知机制一致的结果，并为严重的COVID-19转录反应提供新的见解。需要进一步的实验验证来确认这些生物标志物在不同人群中的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in biomarker sciences and technology Biotechnology, Clinical Biochemistry, Molecular Medicine, Public Health and Health Policy

自引率

0.00%

发文量

审稿时长

20 weeks