Rethinking Data-Intensive Science Using Scalable Analytics Systems

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI:10.1145/2723372.2742787

Frank A. Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, M. Linderman, M. Franklin, A. Joseph, D. Patterson

{"title":"Rethinking Data-Intensive Science Using Scalable Analytics Systems","authors":"Frank A. Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, M. Linderman, M. Franklin, A. Joseph, D. Patterson","doi":"10.1145/2723372.2742787","DOIUrl":null,"url":null,"abstract":"\"Next generation\" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity \"big data\" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8--8.9x improvement over the state-of-the-art MPI-based system.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"99","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2723372.2742787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 99

Abstract

"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8--8.9x improvement over the state-of-the-art MPI-based system.

查看原文本刊更多论文

使用可扩展分析系统重新思考数据密集型科学

“下一代”数据采集技术使科学家能够以更低的成本收集指数级的更多数据。这些趋势正在广泛影响许多科学领域，包括基因组学、天文学和神经科学。我们可以通过应用当前分析系统的水平可扩展技术来加速科学处理管道，从而解决指数数据增长带来的问题。在本文中，我们描述了ADAM，这是一个基因组学管道的例子，它利用开源的Apache Spark和Parquet系统，比当前的基因组学管道实现了28倍的加速，同时降低了63%的成本。通过构建这个系统，我们能够提炼出一套技术，利用商品“大数据”系统有效地实施科学分析。为了展示我们架构的通用性，我们随后实现了一个可扩展的天文图像处理系统，该系统比最先进的基于mpi的系统提高了2.8- 8.9倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量