大数据分析带来的下一代测序革命

Q1 Biochemistry, Genetics and Molecular Biology

Frontiers in Life Science Pub Date : 2016-04-02 DOI:10.1080/21553769.2016.1178180

R. Tripathi, Pawan Sharma, P. Chakraborty, P. Varadwaj

{"title":"大数据分析带来的下一代测序革命","authors":"R. Tripathi, Pawan Sharma, P. Chakraborty, P. Varadwaj","doi":"10.1080/21553769.2016.1178180","DOIUrl":null,"url":null,"abstract":"ABSTRACT Next-generation sequencing (NGS) technology has led to an unrivaled explosion in the amount of genomic data and this escalation has collaterally raised the challenges of sharing, archiving, integrating and analyzing these data. The scale and efficiency of NGS have posed a challenge for analysis of these vast genomic data, gene interactions, annotations and expression studies. However, this limitation of NGS can be safely overcome by tools and algorithms using big data framework. Based on this framework, here we have reviewed the current state of knowledge of big data algorithms for NGS to reveal hidden patterns in sequencing, analysis and annotation, and so on. The APACHE-based Hadoop framework gives an on-interest and adaptable environment for substantial scale data analysis. It has several components for partitioning of large-scale data onto clusters of commodity hardware, in a fault-tolerant manner. Packages like MapReduce, Cloudburst, Crossbow, Myrna, Eoulsan, DistMap, Seal and Contrail perform various NGS applications, such as adapter trimming, quality checking, read mapping, de novo assembly, quantification, expression analysis, variant analysis, and annotation. This review paper deals with the current applications of the Hadoop technology with their usage and limitations in perspective of NGS.","PeriodicalId":12756,"journal":{"name":"Frontiers in Life Science","volume":"9 1","pages":"119 - 149"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/21553769.2016.1178180","citationCount":"34","resultStr":"{\"title\":\"Next-generation sequencing revolution through big data analytics\",\"authors\":\"R. Tripathi, Pawan Sharma, P. Chakraborty, P. Varadwaj\",\"doi\":\"10.1080/21553769.2016.1178180\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Next-generation sequencing (NGS) technology has led to an unrivaled explosion in the amount of genomic data and this escalation has collaterally raised the challenges of sharing, archiving, integrating and analyzing these data. The scale and efficiency of NGS have posed a challenge for analysis of these vast genomic data, gene interactions, annotations and expression studies. However, this limitation of NGS can be safely overcome by tools and algorithms using big data framework. Based on this framework, here we have reviewed the current state of knowledge of big data algorithms for NGS to reveal hidden patterns in sequencing, analysis and annotation, and so on. The APACHE-based Hadoop framework gives an on-interest and adaptable environment for substantial scale data analysis. It has several components for partitioning of large-scale data onto clusters of commodity hardware, in a fault-tolerant manner. Packages like MapReduce, Cloudburst, Crossbow, Myrna, Eoulsan, DistMap, Seal and Contrail perform various NGS applications, such as adapter trimming, quality checking, read mapping, de novo assembly, quantification, expression analysis, variant analysis, and annotation. This review paper deals with the current applications of the Hadoop technology with their usage and limitations in perspective of NGS.\",\"PeriodicalId\":12756,\"journal\":{\"name\":\"Frontiers in Life Science\",\"volume\":\"9 1\",\"pages\":\"119 - 149\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1080/21553769.2016.1178180\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Life Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/21553769.2016.1178180\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Biochemistry, Genetics and Molecular Biology\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Life Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/21553769.2016.1178180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 34

摘要

新一代测序(NGS)技术导致了基因组数据量的空前增长，同时也增加了共享、存档、整合和分析这些数据的挑战。NGS的规模和效率对这些庞大的基因组数据的分析、基因相互作用、注释和表达研究提出了挑战。然而，使用大数据框架的工具和算法可以安全地克服NGS的这一限制。在此框架下，我们回顾了NGS大数据算法的知识现状，以揭示测序、分析和注释等方面的隐藏模式。基于apache的Hadoop框架为大规模数据分析提供了一个感兴趣且可适应的环境。它有几个组件，用于以容错方式将大规模数据分区到商用硬件集群上。MapReduce、Cloudburst、Crossbow、Myrna、Eoulsan、DistMap、Seal和Contrail等软件包执行各种NGS应用程序，如适配器修剪、质量检查、读取映射、从头组装、量化、表达分析、变体分析和注释。本文从NGS的角度综述了Hadoop技术的应用现状、使用情况和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Next-generation sequencing revolution through big data analytics

ABSTRACT Next-generation sequencing (NGS) technology has led to an unrivaled explosion in the amount of genomic data and this escalation has collaterally raised the challenges of sharing, archiving, integrating and analyzing these data. The scale and efficiency of NGS have posed a challenge for analysis of these vast genomic data, gene interactions, annotations and expression studies. However, this limitation of NGS can be safely overcome by tools and algorithms using big data framework. Based on this framework, here we have reviewed the current state of knowledge of big data algorithms for NGS to reveal hidden patterns in sequencing, analysis and annotation, and so on. The APACHE-based Hadoop framework gives an on-interest and adaptable environment for substantial scale data analysis. It has several components for partitioning of large-scale data onto clusters of commodity hardware, in a fault-tolerant manner. Packages like MapReduce, Cloudburst, Crossbow, Myrna, Eoulsan, DistMap, Seal and Contrail perform various NGS applications, such as adapter trimming, quality checking, read mapping, de novo assembly, quantification, expression analysis, variant analysis, and annotation. This review paper deals with the current applications of the Hadoop technology with their usage and limitations in perspective of NGS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Life Science MULTIDISCIPLINARY SCIENCES-

CiteScore

5.50

自引率

0.00%

发文量

期刊介绍： Frontiers in Life Science publishes high quality and innovative research at the frontier of biology with an emphasis on interdisciplinary research. We particularly encourage manuscripts that lie at the interface of the life sciences and either the more quantitative sciences (including chemistry, physics, mathematics, and informatics) or the social sciences (philosophy, anthropology, sociology and epistemology). We believe that these various disciplines can all contribute to biological research and provide original insights to the most recurrent questions.