面向MapReduce的RDF分析查询优化

2014 IEEE 30th International Conference on Data Engineering Workshops Pub Date : 2014-03-01 DOI:10.1109/ICDEW.2014.6818351

P. Ravindra

{"title":"面向MapReduce的RDF分析查询优化","authors":"P. Ravindra","doi":"10.1109/ICDEW.2014.6818351","DOIUrl":null,"url":null,"abstract":"The broadened use of Semantic Web technologies across domains has led to a shift in focus from simple pattern matching queries on RDF data to analytical queries with complex grouping and aggregations. An RDF analytical query involves graph pattern matching, which translates to several join operations due to the fine-grained nature of RDF data model. Complex analytical queries involve multiple grouping-aggregations on different graph patterns, making such tasks join-intensive. Scale-out processing of RDF analytical queries on existing relational-style MapReduce platforms such as Apache Hive and Pig, results in lengthy execution workflows with multiple cycles of I/O and network transfer. Additionally, certain graph patterns result in avoidable redundancy in intermediate results, which negatively impacts processing costs. The PhD thesis summarized in this paper proposes a two-pronged approach to minimize the costs while processing RDF queries on MapReduce: an algebraic approach based on a Nested TripleGroup Data Model and Algebra that reinterprets graph pattern queries in a way that reduces the required number of map-reduce cycles, and special strategies to minimize the redundancy in intermediate data while processing certain graph patterns. The proposed techniques are integrated into Apache Pig. Empirical evaluation of this work for processing graph pattern queries show 45-60% performance gains over systems such as Pig and Hive.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Towards optimization of RDF analytical queries on MapReduce\",\"authors\":\"P. Ravindra\",\"doi\":\"10.1109/ICDEW.2014.6818351\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The broadened use of Semantic Web technologies across domains has led to a shift in focus from simple pattern matching queries on RDF data to analytical queries with complex grouping and aggregations. An RDF analytical query involves graph pattern matching, which translates to several join operations due to the fine-grained nature of RDF data model. Complex analytical queries involve multiple grouping-aggregations on different graph patterns, making such tasks join-intensive. Scale-out processing of RDF analytical queries on existing relational-style MapReduce platforms such as Apache Hive and Pig, results in lengthy execution workflows with multiple cycles of I/O and network transfer. Additionally, certain graph patterns result in avoidable redundancy in intermediate results, which negatively impacts processing costs. The PhD thesis summarized in this paper proposes a two-pronged approach to minimize the costs while processing RDF queries on MapReduce: an algebraic approach based on a Nested TripleGroup Data Model and Algebra that reinterprets graph pattern queries in a way that reduces the required number of map-reduce cycles, and special strategies to minimize the redundancy in intermediate data while processing certain graph patterns. The proposed techniques are integrated into Apache Pig. Empirical evaluation of this work for processing graph pattern queries show 45-60% performance gains over systems such as Pig and Hive.\",\"PeriodicalId\":302600,\"journal\":{\"name\":\"2014 IEEE 30th International Conference on Data Engineering Workshops\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 30th International Conference on Data Engineering Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDEW.2014.6818351\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2014.6818351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

语义Web技术在各个领域的广泛使用导致关注点从RDF数据上的简单模式匹配查询转移到具有复杂分组和聚合的分析查询。RDF分析查询涉及图模式匹配，由于RDF数据模型的细粒度特性，它转换为几个连接操作。复杂的分析查询涉及不同图模式上的多个分组-聚合，使此类任务成为连接密集型任务。在现有的关系型MapReduce平台(如Apache Hive和Pig)上对RDF分析查询进行横向扩展处理，会导致执行工作流程冗长，并且需要多个I/O周期和网络传输。此外，某些图形模式会在中间结果中导致可避免的冗余，从而对处理成本产生负面影响。本文总结的博士论文提出了一种双管齐下的方法来最小化在MapReduce上处理RDF查询的成本:一种基于嵌套TripleGroup数据模型和代数的代数方法，该方法以减少所需map-reduce循环的方式重新解释图模式查询，以及在处理某些图模式时最小化中间数据冗余的特殊策略。所提出的技术被整合到Apache Pig中。对处理图形模式查询的经验评估表明，与Pig和Hive等系统相比，这种工作的性能提高了45-60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards optimization of RDF analytical queries on MapReduce

The broadened use of Semantic Web technologies across domains has led to a shift in focus from simple pattern matching queries on RDF data to analytical queries with complex grouping and aggregations. An RDF analytical query involves graph pattern matching, which translates to several join operations due to the fine-grained nature of RDF data model. Complex analytical queries involve multiple grouping-aggregations on different graph patterns, making such tasks join-intensive. Scale-out processing of RDF analytical queries on existing relational-style MapReduce platforms such as Apache Hive and Pig, results in lengthy execution workflows with multiple cycles of I/O and network transfer. Additionally, certain graph patterns result in avoidable redundancy in intermediate results, which negatively impacts processing costs. The PhD thesis summarized in this paper proposes a two-pronged approach to minimize the costs while processing RDF queries on MapReduce: an algebraic approach based on a Nested TripleGroup Data Model and Algebra that reinterprets graph pattern queries in a way that reduces the required number of map-reduce cycles, and special strategies to minimize the redundancy in intermediate data while processing certain graph patterns. The proposed techniques are integrated into Apache Pig. Empirical evaluation of this work for processing graph pattern queries show 45-60% performance gains over systems such as Pig and Hive.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 30th International Conference on Data Engineering Workshops

自引率

0.00%

发文量