RStore: efficient multiversion document management in the cloud

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-09-24 DOI:10.1145/3127479.3132693

Souvik Bhattacherjee, A. Deshpande

{"title":"RStore: efficient multiversion document management in the cloud","authors":"Souvik Bhattacherjee, A. Deshpande","doi":"10.1145/3127479.3132693","DOIUrl":null,"url":null,"abstract":"Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores1 (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those. RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier. Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this is because the desired set of records cannot be succinctly described as a query. Third, ingest of new versions is difficult for most of the baseline approaches. Finally, exploiting \"record-level compression\" is difficult or impossible in those approaches; this is crucial to be able to handle common use cases where large records (e.g., documents) are updated frequently with relatively small changes. Key Ideas. To address these problems, RStore features a new architecture that partitions the distinct records into approximately equal-sized \"chunks\", with the goal to minimize the number of chunks that need to be retrieved for a given query workload [2]. We establish that the system can adapt to different data and workload requirements through a few simple tuning knobs. The key computational challenge boils down to deciding how to optimally partition the records into chunks; we draw connections to well-studied problems like compressing bipartitite graphs and hypergraph partitioning to show that the problem is NP-Hard in general. Our system features a novel algorithm, that exploits the structure of the version graph, to find an effective partitioning of the records and is built on top of Apache Cassandra. An extensive experimental evaluation is performed over a large number of synthetically constructed datasets to show the effectiveness of RStore and to validate our design decisions.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127479.3132693","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Motivation.The iterative and exploratory nature of the data science process, combined with an increasing need to support debugging, historical queries, auditing, provenance, and reproducibility, warrants the need to store and query a large number of versions of a dataset. This realization has led to many efforts at building data management systems that support versioning as a first-class construct, both in academia [1, 3, 5, 6] and in industry (e.g., git, Datomic, noms). These systems typically support rich versioning/branching functionality and complex queries over versioned information but lack the capability to host versions of a collection of keyed records or documents in a distributed environment or a cloud. Alternatively, key-value stores1 (e.g., Apache Cassandra, HBase, MongoDB) are appealing in many collaborative scenarios spanning geographically distributed teams, since they offer centralized hosting of the data, are resilient to failures, can easily scale out, and can handle a large number of queries efficiently. However, those do not offer rich versioning and branching functionality akin to hosted version control systems (VCS) like GitHub. This work addresses the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those. RStore Overview. Our primary focus here is to provide versioning and branching support for collections of records with unique identifiers. Like popular NoSQL systems, RStore supports a flexible data model; records with varying sizes, ranging from a few bytes to a few MBs; and a variety of retrieval queries to cover a wide range of use cases. Specifically, similar to NoSQL systems, our system supports efficient retrieval of a specific record in a specific version (given a key and a version identifier), or the entire evolution history for a given key. Similar to VCS, it supports retrieving all records belonging to a specific version to support use cases that require updating a large number of records (e.g., by applying a data cleaning step). Finally, since retrieving an entire version might be unnecessary and expensive, our system supports partial version retrieval given a range of keys and a version identifier. Challenges. Addressing the above desiderata poses many design and computational challenges, and natural baseline approaches (see full paper [2] for more details) that attempt to build this functionality on top of existing key-value stores suffer from critical limitations. First, most of those baseline approaches cannot directly support point queries targetting a specific record in a specific version (and by extension, full or partial version retrieval queries), without constructing and maintaining explicit indexes. Second, all the viable baselines fundamentally require too many back-and-forths between the retrieval module and the backend key-value store; this is because the desired set of records cannot be succinctly described as a query. Third, ingest of new versions is difficult for most of the baseline approaches. Finally, exploiting "record-level compression" is difficult or impossible in those approaches; this is crucial to be able to handle common use cases where large records (e.g., documents) are updated frequently with relatively small changes. Key Ideas. To address these problems, RStore features a new architecture that partitions the distinct records into approximately equal-sized "chunks", with the goal to minimize the number of chunks that need to be retrieved for a given query workload [2]. We establish that the system can adapt to different data and workload requirements through a few simple tuning knobs. The key computational challenge boils down to deciding how to optimally partition the records into chunks; we draw connections to well-studied problems like compressing bipartitite graphs and hypergraph partitioning to show that the problem is NP-Hard in general. Our system features a novel algorithm, that exploits the structure of the version graph, to find an effective partitioning of the records and is built on top of Apache Cassandra. An extensive experimental evaluation is performed over a large number of synthetically constructed datasets to show the effectiveness of RStore and to validate our design decisions.

查看原文本刊更多论文

RStore:云端高效的多版本文档管理

动机。数据科学过程的迭代性和探索性，加上支持调试、历史查询、审计、来源和再现性的需求不断增长，保证了存储和查询大量数据集版本的需求。在学术界[1,3,5,6]和工业界(例如git、Datomic、noms)中，这种认识导致了许多构建支持版本控制的数据管理系统的努力。这些系统通常支持丰富的版本控制/分支功能和对版本控制信息的复杂查询，但缺乏在分布式环境或云环境中托管键控记录或文档集合的版本的能力。另外，键值存储(例如，Apache Cassandra, HBase, MongoDB)在许多跨地理分布团队的协作场景中很有吸引力，因为它们提供集中的数据托管，对故障具有弹性，可以轻松扩展，并且可以有效地处理大量查询。然而，它们不提供丰富的版本控制和分支功能，类似于托管版本控制系统(VCS)，如GitHub。这项工作解决了在分布式环境中紧凑存储一组关键文档或记录的大量版本(快照)的问题，同时有效地回答了对这些文档或记录的各种检索查询。RStore概述。我们在这里的主要重点是为具有唯一标识符的记录集合提供版本控制和分支支持。与流行的NoSQL系统一样，RStore支持灵活的数据模型;大小不等的记录，从几个字节到几mb不等;以及各种检索查询，以覆盖广泛的用例。具体来说，与NoSQL系统类似，我们的系统支持对特定版本(给定键和版本标识符)中的特定记录进行高效检索，或者对给定键的整个演化历史进行检索。与VCS类似，它支持检索属于特定版本的所有记录，以支持需要更新大量记录的用例(例如，通过应用数据清理步骤)。最后，由于检索整个版本可能是不必要且昂贵的，因此我们的系统支持给定一系列键和版本标识符的部分版本检索。挑战。解决上述需求带来了许多设计和计算方面的挑战，并且试图在现有键值存储之上构建此功能的自然基线方法(请参阅全文[2]了解更多详细信息)受到严重限制。首先，如果不构造和维护显式索引，大多数基线方法都不能直接支持针对特定版本中的特定记录的点查询(以及扩展为完整或部分版本检索查询)。其次，所有可行的基线基本上都需要在检索模块和后端键值存储之间进行太多的来回转换;这是因为所需的记录集不能简洁地描述为查询。第三，对于大多数基线方法来说，摄取新版本是困难的。最后，利用“记录级压缩”在这些方法中是困难的或不可能的;这对于能够处理大型记录(例如，文档)经常以相对较小的更改更新的常见用例是至关重要的。关键的想法。为了解决这些问题，RStore采用了一种新的架构，将不同的记录划分为大约相等大小的“块”，目标是在给定的查询工作负载下最小化需要检索的块的数量[2]。通过几个简单的调优旋钮，系统就可以适应不同的数据和工作负载需求。关键的计算挑战归结为决定如何以最佳方式将记录划分为块;我们把一些研究得很好的问题联系起来，比如压缩二分图和超图划分，以表明这个问题通常是np困难的。我们的系统采用了一种新颖的算法，该算法利用版本图的结构来找到有效的记录分区，并建立在Apache Cassandra之上。在大量综合构建的数据集上进行了广泛的实验评估，以显示RStore的有效性并验证我们的设计决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量