大数据中的马赛克:Stratosphere、Apache Flink等

Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems Pub Date : 2018-06-25 DOI:10.1145/3210284.3214344

V. Markl

{"title":"大数据中的马赛克:Stratosphere、Apache Flink等","authors":"V. Markl","doi":"10.1145/3210284.3214344","DOIUrl":null,"url":null,"abstract":"The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define \"big data\", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.","PeriodicalId":412438,"journal":{"name":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond\",\"authors\":\"V. Markl\",\"doi\":\"10.1145/3210284.3214344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define \\\"big data\\\", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.\",\"PeriodicalId\":412438,\"journal\":{\"name\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"volume\":\"122 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3210284.3214344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3210284.3214344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

全球数据库研究界已经极大地影响了数据存储和处理系统的功能和性能，这些系统沿着定义“大数据”的维度，即体积、速度、种类和准确性。在本地，在过去的五年里，我们也在不同的方面进行了工作。我们的贡献包括:(1)建立一个数据库启发的大数据分析系统的愿景，该系统将数据库和分布式系统技术的精华结合起来，并从编译器(例如迭代)和数据流处理中汲取概念，以及(2)形成一个由研究人员和机构组成的社区，创建平流层平台来实现我们的愿景。这些活动的一个主要成果是Apache Flink，一个开源的大数据分析平台，以及它蓬勃发展的全球开发人员和生产用户社区。虽然已经取得了很大的进展，但是纵观整个大数据栈，数据库研究界仍然面临着一个主要的挑战。也就是说，尽管数据分析的异构性和复杂性日益增加，涉及端到端数据分析管道的各个方面的专用引擎，包括基于图的、基于线性代数的和基于关系的算法，以及底层的、日益异构的硬件和计算基础设施，如何保持易用性。在柏林工业大学，DFKI和柏林大数据中心(BBDC)，我们的目标是通过马赛克项目推进这一领域的研究。我们的目标是补救一些阻碍开发人员生产力的异构性挑战，并将数据科学技术的使用限制在少数特权人员(他们是令人垂涎的专家)的范围内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mosaics in Big Data: Stratosphere, Apache Flink, and Beyond

The global database research community has greatly impacted the functionality and performance of data storage and processing systems along the dimensions that define "big data", i.e., volume, velocity, variety, and veracity. Locally, over the past five years, we have also been working on varying fronts. Among our contributions are: (1) establishing a vision for a database-inspired big data analytics system, which unifies the best of database and distributed systems technologies, and augments it with concepts drawn from compilers (e.g., iterations) and data stream processing, as well as (2) forming a community of researchers and institutions to create the Stratosphere platform to realize our vision. One major result from these activities was Apache Flink, an open-source big data analytics platform and its thriving global community of developers and production users. Although much progress has been made, when looking at the overall big data stack, a major challenge for database research community still remains. That is, how to maintain the ease-of-use despite the increasing heterogeneity and complexity of data analytics, involving specialized engines for various aspects of an end-to-end data analytics pipeline, including, among others, graph-based, linear algebra-based, and relational-based algorithms, and the underlying, increasingly heterogeneous hardware and computing infrastructure. At TU Berlin, DFKI, and the Berlin Big Data Center (BBDC), we aim to advance research in this field via the Mosaics project. Our goal is to remedy some of the heterogeneity challenges that hamper developer productivity and limit the use of data science technologies to just the privileged few, who are coveted experts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

自引率

0.00%

发文量