A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis

IF 3.9 4区生物学 Q1 GENETICS & HEREDITY

Functional & Integrative Genomics Pub Date : 2024-08-19 DOI:10.1007/s10142-024-01415-x

Kasmika Borah, Himanish Shekhar Das, Soumita Seth, Koushik Mallick, Zubair Rahaman, Saurav Mallik

{"title":"A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis","authors":"Kasmika Borah, Himanish Shekhar Das, Soumita Seth, Koushik Mallick, Zubair Rahaman, Saurav Mallik","doi":"10.1007/s10142-024-01415-x","DOIUrl":null,"url":null,"abstract":"<div><p>Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.</p></div>","PeriodicalId":574,"journal":{"name":"Functional & Integrative Genomics","volume":"24 5","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Functional & Integrative Genomics","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s10142-024-01415-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.

Abstract Image

查看原文本刊更多论文

高维 NGS 数据分析中特征选择和特征提取的进展综述。

生物医学技术的最新进展和高维下一代测序（NGS）数据集的激增导致了数据量和数据密度的显著增长。与生物样本数量相比，NGS 高维数据的特点是具有大量基因组学、转录组学、蛋白质组学和元基因组学特征，这给降低特征维度带来了巨大挑战。NGS 数据的高维度给数据分析带来了巨大挑战，包括增加计算负担、潜在的过拟合以及解释结果的困难。特征选择和特征提取是应对这些挑战的两种关键技术，它们可以降低数据维度，从而提高模型性能、可解释性和计算效率。特征选择和特征提取可分为统计方法和机器学习方法。本研究对各种基于统计、机器学习和深度学习的特征选择和提取技术进行了全面的比较综述，这些技术是专门为人类的 NGS 和微阵列数据解读量身定制的。为了收集这些技术的信息，我们进行了全面的文献检索，重点是基于阵列和 NGS 的数据分析。针对本文调查的基于微阵列、批量 RNA-Seq 和单细胞、单细胞 RNA-Seq（scRNA-Seq）技术的数据集，探索了各种技术，包括深度学习架构、机器学习算法和统计方法。本研究概述了这些技术，强调了它们在高维 NGS 数据中的应用、优势和局限性。这篇综述为读者应用特征选择和特征提取技术提高预测模型的性能、揭示潜在的生物学模式以及深入了解大量复杂的 NGS 和微阵列数据提供了更好的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Functional & Integrative Genomics 生物-遗传学

CiteScore

3.50

自引率

3.40%

发文量

审稿时长

2 months

期刊介绍： Functional & Integrative Genomics is devoted to large-scale studies of genomes and their functions, including systems analyses of biological processes. The journal will provide the research community an integrated platform where researchers can share, review and discuss their findings on important biological questions that will ultimately enable us to answer the fundamental question: How do genomes work?