Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq

Pattern Recognition [Working Title] Pub Date : 2020-12-23 DOI:10.5772/intechopen.94069

Ismail Jamail, A. Moussa

{"title":"Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq","authors":"Ismail Jamail, A. Moussa","doi":"10.5772/intechopen.94069","DOIUrl":null,"url":null,"abstract":"Latest developments in high-throughput cDNA sequencing (RNA-seq) have revolutionized gene expression profiling. This analysis aims to compare the expression levels of multiple genes between two or more samples, under specific circumstances or in a specific cell to give a global picture of cellular function. Thanks to these advances, gene expression data are being generated in large throughput. One of the primary data analysis tasks for gene expression studies involves data-mining techniques such as clustering and classification. Clustering, which is an unsupervised learning technique, has been widely used as a computational tool to facilitate our understanding of gene functions and regulations involved in a biological process. Cluster analysis aims to group the large number of genes present in a sample of gene expression profile data, such that similar or related genes are in same clusters, and different or unrelated genes are in distinct ones. Classification on the other hand can be used for grouping samples based on their expression profile. There are many clustering and classification algorithms that can be applied in gene expression experiments, the most widely used are hierarchical clustering, k-means clustering and model-based clustering that depend on a model to sort out the number of clusters. Depending on the data structure, a fitting clustering method must be used. In this chapter, we present a state of art of clustering algorithms and statistical approaches for grouping similar gene expression profiles that can be applied to RNA-seq data analysis and software tools dedicated to these methods. In addition, we discuss challenges in cluster analysis, and compare the performance of height commonly used clustering methods on four different public datasets from recount2.","PeriodicalId":319532,"journal":{"name":"Pattern Recognition [Working Title]","volume":"30 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition [Working Title]","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5772/intechopen.94069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Latest developments in high-throughput cDNA sequencing (RNA-seq) have revolutionized gene expression profiling. This analysis aims to compare the expression levels of multiple genes between two or more samples, under specific circumstances or in a specific cell to give a global picture of cellular function. Thanks to these advances, gene expression data are being generated in large throughput. One of the primary data analysis tasks for gene expression studies involves data-mining techniques such as clustering and classification. Clustering, which is an unsupervised learning technique, has been widely used as a computational tool to facilitate our understanding of gene functions and regulations involved in a biological process. Cluster analysis aims to group the large number of genes present in a sample of gene expression profile data, such that similar or related genes are in same clusters, and different or unrelated genes are in distinct ones. Classification on the other hand can be used for grouping samples based on their expression profile. There are many clustering and classification algorithms that can be applied in gene expression experiments, the most widely used are hierarchical clustering, k-means clustering and model-based clustering that depend on a model to sort out the number of clusters. Depending on the data structure, a fitting clustering method must be used. In this chapter, we present a state of art of clustering algorithms and statistical approaches for grouping similar gene expression profiles that can be applied to RNA-seq data analysis and software tools dedicated to these methods. In addition, we discuss challenges in cluster analysis, and compare the performance of height commonly used clustering methods on four different public datasets from recount2.

查看原文本刊更多论文

RNA-Seq基因表达数据聚类方法的最新进展

高通量cDNA测序(RNA-seq)的最新进展彻底改变了基因表达谱。这种分析的目的是比较两个或多个样本之间的多个基因的表达水平，在特定的情况下或在一个特定的细胞给出一个整体的图像细胞功能。由于这些进步，基因表达数据正在大量生成。基因表达研究的主要数据分析任务之一涉及数据挖掘技术，如聚类和分类。聚类作为一种无监督学习技术，已被广泛用作一种计算工具，以促进我们对参与生物过程的基因功能和调控的理解。聚类分析的目的是将基因表达谱数据样本中存在的大量基因进行分组，将相似或相关的基因放在同一类中，将不同或不相关的基因放在不同的类中。另一方面，分类可用于根据其表达谱对样本进行分组。基因表达实验中可以应用的聚类和分类算法有很多，其中应用最广泛的是分层聚类、k-means聚类和基于模型的聚类，它们依靠一个模型来整理聚类的数量。根据数据结构的不同，必须使用拟合聚类方法。在本章中，我们介绍了用于分组类似基因表达谱的聚类算法和统计方法的最新进展，这些方法可应用于RNA-seq数据分析和专用于这些方法的软件工具。此外，我们讨论了聚类分析的挑战，并比较了高度常用聚类方法在四个不同的公共数据集上的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition [Working Title]

自引率

0.00%

发文量