An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification

arXiv - QuanBio - Genomics Pub Date : 2024-05-16 DOI:arxiv-2405.09756

Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki

{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":null,"url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\nof state-of-the-art machine learning methodologies has emerged as a promising\nresearch area. In molecular biology, there has been an explosion of data\ngenerated from multi-omics sequencing. The advent sequencing equipment can\nprovide large number of complicated measurements per one experiment. Therefore,\ntraditional statistical methods face challenging tasks when dealing with such\nhigh dimensional data. However, most of the information contained in these\ndatasets is redundant or unrelated and can be effectively reduced to\nsignificantly fewer variables without losing much information. Dimensionality\nreduction techniques are mathematical procedures that allow for this reduction;\nthey have largely been developed through statistics and machine learning\ndisciplines. The other challenge in medical datasets is having an imbalanced\nnumber of samples in the classes, which leads to biased results in machine\nlearning models. This study, focused on tackling these challenges in a neural\nnetwork that incorporates autoencoder to extract latent space of the features,\nand Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\nspace is the reduced dimensional space that captures the meaningful features of\nthe original data. Our model starts with feature selection to select the\ndiscriminative features before feeding them to the neural network. Then, the\nmodel predicts the outcome of cancer for different datasets. The proposed model\noutperformed other existing models by scoring accuracy of 95.09% for bladder\ncancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"214 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.09756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.

查看原文本刊更多论文

一种自动编码器和生成式对抗网络方法用于多传感器数据不平衡类别处理和分类

在提高医疗诊断水平的不懈努力中，整合最先进的机器学习方法已成为一个前景广阔的研究领域。在分子生物学领域，多组学测序产生的数据呈爆炸式增长。新出现的测序设备可以在一次实验中提供大量复杂的测量数据。因此，传统的统计方法在处理这种高维数据时面临挑战。然而，这些数据集中包含的大部分信息都是冗余或不相关的，因此可以有效地减少变量数量而不会丢失太多信息。降维技术是实现降维的数学方法，主要是通过统计学和机器学习学科发展起来的。医学数据集面临的另一个挑战是类中样本数量不平衡，这会导致机器学习模型的结果出现偏差。本研究的重点是在神经网络中应对这些挑战，该网络结合了自动编码器来提取特征的潜在空间，以及生成对抗网络（GAN）来生成合成样本。潜在空间是一个缩小了的维度空间，它捕捉了原始数据的有意义特征。我们的模型从特征选择开始，选择具有区分度的特征，然后将其输入神经网络。然后，模型预测不同数据集的癌症结果。所提出的模型在膀胱癌数据集和乳腺癌数据集上的准确率分别为 95.09% 和 88.82%，优于其他现有模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量