Distributed Evaluation of XPath Axes Queries over Large XML Documents Stored in MapReduce Clusters

2014 25th International Workshop on Database and Expert Systems Applications Pub Date : 2014-09-07 DOI:10.1109/DEXA.2014.59

Adam Senk, M. Valenta, W. Benn

引用次数: 2

Abstract

The MR (MapReduce) framework, a programming model for parallel computation over data stored in a cluster of commodity computers, established itself as one of the leading solutions for Big Data processing. This framework is also being used like a query language in many database systems, because it can process data stored in various unstructured, semi-structured, and structured formats. Nevertheless, the MR framework can be used for XML data processing too, it does not allow to write queries in a declarative manner, like XPath or XQuery. To overcome this problem, we propose a system that enables to query XML data with XPath, but it evaluates the queries in parallel using the MR framework. First, we introduce a persistent storage that maps XML data into a wide-column store. The proposed mapping enables efficient and distributed data processing. Secondly, we describe a query processor translating an XPath language subset to MR jobs. Finally, we present tests and their results showing the scalability of our system.

查看原文本刊更多论文

存储在MapReduce集群中的大型XML文档的XPath轴查询的分布式求值

MR (MapReduce)框架是一种对存储在商用计算机集群中的数据进行并行计算的编程模型，已成为大数据处理的领先解决方案之一。这个框架在许多数据库系统中也被用作查询语言，因为它可以处理以各种非结构化、半结构化和结构化格式存储的数据。尽管如此，MR框架也可以用于XML数据处理，但它不允许以声明性方式编写查询，如XPath或XQuery。为了克服这个问题，我们提出了一个能够使用XPath查询XML数据的系统，但是它使用MR框架并行地计算查询。首先，我们引入一个持久化存储，它将XML数据映射到一个宽列存储。所建议的映射支持高效和分布式的数据处理。其次，我们描述了一个将XPath语言子集转换为MR作业的查询处理器。最后，给出了测试结果，说明了系统的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 25th International Workshop on Database and Expert Systems Applications

自引率

0.00%

发文量