A NoSQL Data Model for Scalable Big Data Workflow Execution

2016 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2016-06-01 DOI:10.1109/BigDataCongress.2016.15

Aravind Mohan, M. Ebrahimi, Shiyong Lu, Alexander Kotov

{"title":"A NoSQL Data Model for Scalable Big Data Workflow Execution","authors":"Aravind Mohan, M. Ebrahimi, Shiyong Lu, Alexander Kotov","doi":"10.1109/BigDataCongress.2016.15","DOIUrl":null,"url":null,"abstract":"While big data workflows haven been proposed recently as the next-generation data-centric workflow paradigm to process and analyze data of ever increasing in scale, complexity, and rate of acquisition, a scalable distributed data model is still missing that abstracts and automates data distribution, parallelism, and scalable processing. In the meanwhile, although NoSQL has emerged as a new category of data models, they are optimized for storing and querying of large datasets, not for ad-hoc data analysis where data placement and data movement are necessary for optimized workflow execution. In this paper, we propose a NoSQL data model that: 1) supports high-performance MapReduce-style workflows that automate data partitioning and data-parallelism execution. In contrast to the traditional MapReduce framework, our MapReduce-style workflows are fully composable with other workflows enabling dataflow applications with a richer structure, 2) automates virtual machine provisioning and deprovisioning on demand according to the sizes of input datasets, 3) enables a flexible framework for workflow executors that take advantage of the proposed NoSQL data model to improve the performance of workflow execution. Our case studies and experiments show the competitive advantages of our proposed data model. The proposed NoSQL data model is implemented in a new release of DATAVIEW, one of the most usable big data workflow systems in the community.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

While big data workflows haven been proposed recently as the next-generation data-centric workflow paradigm to process and analyze data of ever increasing in scale, complexity, and rate of acquisition, a scalable distributed data model is still missing that abstracts and automates data distribution, parallelism, and scalable processing. In the meanwhile, although NoSQL has emerged as a new category of data models, they are optimized for storing and querying of large datasets, not for ad-hoc data analysis where data placement and data movement are necessary for optimized workflow execution. In this paper, we propose a NoSQL data model that: 1) supports high-performance MapReduce-style workflows that automate data partitioning and data-parallelism execution. In contrast to the traditional MapReduce framework, our MapReduce-style workflows are fully composable with other workflows enabling dataflow applications with a richer structure, 2) automates virtual machine provisioning and deprovisioning on demand according to the sizes of input datasets, 3) enables a flexible framework for workflow executors that take advantage of the proposed NoSQL data model to improve the performance of workflow execution. Our case studies and experiments show the competitive advantages of our proposed data model. The proposed NoSQL data model is implemented in a new release of DATAVIEW, one of the most usable big data workflow systems in the community.

查看原文本刊更多论文

面向可扩展大数据工作流执行的NoSQL数据模型

虽然大数据工作流最近被提出作为下一代以数据为中心的工作流范式来处理和分析规模、复杂性和获取速度不断增长的数据，但一个可扩展的分布式数据模型仍然缺失，它可以抽象和自动化数据分布、并行性和可扩展处理。与此同时，尽管NoSQL已经成为一种新的数据模型类别，但它们是针对大型数据集的存储和查询进行优化的，而不是针对特定的数据分析，而数据放置和数据移动是优化工作流执行所必需的。在本文中，我们提出了一个NoSQL数据模型:1)支持高性能mapreduce风格的工作流，自动执行数据分区和数据并行执行。与传统的MapReduce框架相比，我们的MapReduce风格的工作流与其他工作流完全可组合，使数据流应用程序具有更丰富的结构，2)根据输入数据集的大小自动提供虚拟机配置和解除配置，3)为工作流执行者提供灵活的框架，利用所提出的NoSQL数据模型来提高工作流执行的性能。我们的案例研究和实验显示了我们提出的数据模型的竞争优势。提出的NoSQL数据模型在DATAVIEW的新版本中实现，DATAVIEW是社区中最好用的大数据工作流系统之一。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量