DiNoDB: Efficient Large-Scale Raw Data Analytics

Data4U '14 Pub Date : 2014-09-01 DOI:10.1145/2658840.2658841

Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic

{"title":"DiNoDB: Efficient Large-Scale Raw Data Analytics","authors":"Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic","doi":"10.1145/2658840.2658841","DOIUrl":null,"url":null,"abstract":"Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data.\n In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data.\n Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.","PeriodicalId":135661,"journal":{"name":"Data4U '14","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data4U '14","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2658840.2658841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data. In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data. Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

查看原文本刊更多论文

高效的大规模原始数据分析

现代大数据工作流程，例如机器学习用例，通常涉及批处理分析周期的迭代和对临时数据的交互式分析。虽然针对大量原始数据的批处理分析解决方案已经建立良好(例如Hadoop、MapReduce)，但最先进的交互式分析解决方案(例如分布式无共享rdbms)需要数据加载和/或转换阶段，这对于临时数据来说本质上是昂贵的。在本文中，我们提出了一种新的可扩展的分布式原位数据分析解决方案，该解决方案提供了对原始数据的可扩展批处理和交互式数据分析，从而避免了rdbms的加载阶段瓶颈。我们的系统结合了基于MapReduce的平台和最近提出的NoDB范例，它优化了传统的集中式rdbms，用于原始文件的原位查询。我们重新审视了NoDB的集中式设计，并将其扩展为支持多个客户端和数据处理节点，从而产生一个新的分布式数据分析系统，我们称之为分布式NoDB (DiNoDB)。DiNoDB利用MapReduce批量查询来生成关键的元数据(例如，分布式位置地图和垂直索引)，以加速交互式查询，而不需要数据加载和数据移动阶段的开销，允许用户快速有效地利用他们的数据。我们的实验分析表明，与Shark、Hive和HadoopDB等先进的分布式查询引擎相比，DiNoDB显著降低了数据到查询的延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data4U '14

自引率

0.00%

发文量