Making sense of big data with the Berkeley data analytics stack

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI:10.1145/2484838.2484884

M. Franklin

{"title":"Making sense of big data with the Berkeley data analytics stack","authors":"M. Franklin","doi":"10.1145/2484838.2484884","DOIUrl":null,"url":null,"abstract":"The Berkeley AMPLab was founded on the idea that the challenges of emerging Big Data applications require a new approach to analytics systems. Launching in early 2011, the project set out to rethink the traditional analytics stack, breaking down technical and intellectual barriers that had arisen during decades of evolutionary development. The vision of the lab is to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (such as machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and en masse, as with crowd-sourced human computation). To pursue this goal, we assembled a research team with diverse interests across computer science, forged relationships with domain experts on campus and elsewhere, and obtained the support of leading industry partners and major government sponsors. The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the nearly three years the lab has been in operation, we've released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. In this talk I'll describe the current state of BDAS with an emphasis on the key components that have been released to date. I'll then discuss ongoing efforts on machine learning scalability and ease of use, including the MLbase system, as our focus moves higher up the stack. Finally I will present our longer-term views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484838.2484884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

The Berkeley AMPLab was founded on the idea that the challenges of emerging Big Data applications require a new approach to analytics systems. Launching in early 2011, the project set out to rethink the traditional analytics stack, breaking down technical and intellectual barriers that had arisen during decades of evolutionary development. The vision of the lab is to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (such as machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and en masse, as with crowd-sourced human computation). To pursue this goal, we assembled a research team with diverse interests across computer science, forged relationships with domain experts on campus and elsewhere, and obtained the support of leading industry partners and major government sponsors. The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the nearly three years the lab has been in operation, we've released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. In this talk I'll describe the current state of BDAS with an emphasis on the key components that have been released to date. I'll then discuss ongoing efforts on machine learning scalability and ease of use, including the MLbase system, as our focus moves higher up the stack. Finally I will present our longer-term views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.

查看原文本刊更多论文

利用伯克利数据分析堆栈理解大数据

伯克利AMPLab的建立理念是，新兴大数据应用的挑战需要一种新的分析系统方法。该项目于2011年初启动，旨在重新思考传统的分析堆栈，打破在几十年的进化发展中出现的技术和智力障碍。该实验室的愿景是无缝集成用于大规模理解数据的三种主要资源:算法(如机器学习和统计技术)、机器(以可扩展集群和弹性云计算的形式)和人(作为分析师的个体和群体，如众包的人类计算)。为了实现这一目标，我们组建了一个对计算机科学有不同兴趣的研究团队，与校园和其他地方的领域专家建立了关系，并获得了领先的行业合作伙伴和主要政府赞助商的支持。该实验室正在通过开发一个名为BDAS的免费开源软件堆栈来实现其想法:伯克利数据分析堆栈。在实验室运行的近三年里，我们发布了BDAS的主要组件。其中一些组件已经在工业和其他地方获得了显著的吸引力:Mesos集群资源管理器、Spark内存计算框架和Shark查询处理系统。在这次演讲中，我将描述BDAS的当前状态，重点介绍到目前为止已经发布的关键组件。然后，我将讨论正在进行的机器学习可扩展性和易用性方面的工作，包括MLbase系统，因为我们的重点将向堆栈的更高方向移动。最后，我将展示我们的长期观点，即所有的部分将如何组合在一起形成一个系统，该系统可以自适应地为给定的数据驱动问题提供正确的资源，以满足整个分析生命周期的时间、成本和质量要求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量