大型基因组数据集的交互式探索。

EECS technical report series Pub Date : 2016-01-01 Epub Date: 2016-05-16

Eric Tu

{"title":"大型基因组数据集的交互式探索。","authors":"Eric Tu","doi":"","DOIUrl":null,"url":null,"abstract":"The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM [4], genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley [25] and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango.","PeriodicalId":92200,"journal":{"name":"EECS technical report series","volume":"2016 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5754031/pdf/nihms853628.pdf","citationCount":"0","resultStr":"{\"title\":\"Interactive Exploration on Large Genomic Datasets.\",\"authors\":\"Eric Tu\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM [4], genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley [25] and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango.\",\"PeriodicalId\":92200,\"journal\":{\"name\":\"EECS technical report series\",\"volume\":\"2016 \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5754031/pdf/nihms853628.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EECS technical report series\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2016/5/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EECS technical report series","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/5/16 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型基因组数据集的流行使得探索这些数据的需求变得更加重要。像1000基因组计划[1]这样的大型测序项目重建了来自26个种群的2504个个体的基因组，产生了超过200TB的公开数据。同时，现有的基因组可视化工具已经无法适应不断增长的更大、更复杂的数据量。在查看大区域(超过1兆碱基或1,000,000个DNA碱基)或同时查看多个数据样本时，这个困难非常严重。虽然基因组处理管道已经转向使用分布式计算技术，如ADAM[4]，但基因组可视化工具却没有。在这项工作中，我们介绍了Mango，一个建立在ADAM之上的可扩展基因组浏览器，可以在本地和集群上运行。Mango提供了不同优化的组合，可以在单个应用程序中组合在一起，在tb的基因组数据上驱动新颖的基因组可视化技术。通过在分布式处理管道之上构建可视化，我们可以在当前工具无法实现的大区域上执行可视化查询，并减少查看大型数据集的时间。Mango是加州大学伯克利分校(University of California-Berkeley)大数据基因组学项目的一部分[25]，在Apache 2许可下发布。Mango的网站是https://github.com/bigdatagenomics/mango。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Interactive Exploration on Large Genomic Datasets.

本刊更多论文

Interactive Exploration on Large Genomic Datasets.

The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM [4], genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley [25] and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EECS technical report series

自引率

0.00%

发文量