Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki
{"title":"An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification","authors":"Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki","doi":"arxiv-2405.09756","DOIUrl":null,"url":null,"abstract":"In the relentless efforts in enhancing medical diagnostics, the integration\nof state-of-the-art machine learning methodologies has emerged as a promising\nresearch area. In molecular biology, there has been an explosion of data\ngenerated from multi-omics sequencing. The advent sequencing equipment can\nprovide large number of complicated measurements per one experiment. Therefore,\ntraditional statistical methods face challenging tasks when dealing with such\nhigh dimensional data. However, most of the information contained in these\ndatasets is redundant or unrelated and can be effectively reduced to\nsignificantly fewer variables without losing much information. Dimensionality\nreduction techniques are mathematical procedures that allow for this reduction;\nthey have largely been developed through statistics and machine learning\ndisciplines. The other challenge in medical datasets is having an imbalanced\nnumber of samples in the classes, which leads to biased results in machine\nlearning models. This study, focused on tackling these challenges in a neural\nnetwork that incorporates autoencoder to extract latent space of the features,\nand Generative Adversarial Networks (GAN) to generate synthetic samples. Latent\nspace is the reduced dimensional space that captures the meaningful features of\nthe original data. Our model starts with feature selection to select the\ndiscriminative features before feeding them to the neural network. Then, the\nmodel predicts the outcome of cancer for different datasets. The proposed model\noutperformed other existing models by scoring accuracy of 95.09% for bladder\ncancer dataset and 88.82% for the breast cancer dataset.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"214 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.09756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the relentless efforts in enhancing medical diagnostics, the integration
of state-of-the-art machine learning methodologies has emerged as a promising
research area. In molecular biology, there has been an explosion of data
generated from multi-omics sequencing. The advent sequencing equipment can
provide large number of complicated measurements per one experiment. Therefore,
traditional statistical methods face challenging tasks when dealing with such
high dimensional data. However, most of the information contained in these
datasets is redundant or unrelated and can be effectively reduced to
significantly fewer variables without losing much information. Dimensionality
reduction techniques are mathematical procedures that allow for this reduction;
they have largely been developed through statistics and machine learning
disciplines. The other challenge in medical datasets is having an imbalanced
number of samples in the classes, which leads to biased results in machine
learning models. This study, focused on tackling these challenges in a neural
network that incorporates autoencoder to extract latent space of the features,
and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent
space is the reduced dimensional space that captures the meaningful features of
the original data. Our model starts with feature selection to select the
discriminative features before feeding them to the neural network. Then, the
model predicts the outcome of cancer for different datasets. The proposed model
outperformed other existing models by scoring accuracy of 95.09% for bladder
cancer dataset and 88.82% for the breast cancer dataset.