Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

Hyeongsik Kim, P. Ravindra, Kemafor Anyanwu
{"title":"Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce","authors":"Hyeongsik Kim, P. Ravindra, Kemafor Anyanwu","doi":"10.1109/CLOUD.2012.14","DOIUrl":null,"url":null,"abstract":"Recently, the number and size of RDF data collections has increased rapidly making the issue of scalable processing techniques crucial. The MapReduce model has become a de facto standard for large scale data processing using a cluster of machines in the cloud. Generally, RDF query processing creates join-intensive workloads, resulting in lengthy MapReduce workflows with expensive I/O, data transfer, and sorting costs. However, the MapReduce computation model provides limited static optimization techniques used in relational databases (e.g., indexing and cost-based optimization). Consequently, dynamic optimization techniques for such join-intensive tasks on MapReduce need to be investigated. In some previous efforts, we propose a Nested Triple Group data model and Algebra (NTGA) for efficient graph pattern query processing in the cloud. Here, we extend this work with a scan-sharing technique that is used to optimize the processing of graph patterns with repeated properties. Specifically, our scan-sharing technique eliminates the need for repeated scanning of input relations when properties are used repeatedly in graph patterns. A formal foundation underlying this scan sharing technique is discussed as well as an implementation strategy that has been integrated in the Apache Pig framework is presented. We also present a comprehensive evaluation demonstrating performance benefits of our NTGA plus scan-sharing approach.","PeriodicalId":214084,"journal":{"name":"2012 IEEE Fifth International Conference on Cloud Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Fifth International Conference on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD.2012.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Recently, the number and size of RDF data collections has increased rapidly making the issue of scalable processing techniques crucial. The MapReduce model has become a de facto standard for large scale data processing using a cluster of machines in the cloud. Generally, RDF query processing creates join-intensive workloads, resulting in lengthy MapReduce workflows with expensive I/O, data transfer, and sorting costs. However, the MapReduce computation model provides limited static optimization techniques used in relational databases (e.g., indexing and cost-based optimization). Consequently, dynamic optimization techniques for such join-intensive tasks on MapReduce need to be investigated. In some previous efforts, we propose a Nested Triple Group data model and Algebra (NTGA) for efficient graph pattern query processing in the cloud. Here, we extend this work with a scan-sharing technique that is used to optimize the processing of graph patterns with repeated properties. Specifically, our scan-sharing technique eliminates the need for repeated scanning of input relations when properties are used repeatedly in graph patterns. A formal foundation underlying this scan sharing technique is discussed as well as an implementation strategy that has been integrated in the Apache Pig framework is presented. We also present a comprehensive evaluation demonstrating performance benefits of our NTGA plus scan-sharing approach.
MapReduce上优化RDF图模式匹配的扫描共享
最近,RDF数据集合的数量和大小迅速增加,使得可伸缩处理技术的问题变得至关重要。MapReduce模型已经成为使用云中的机器集群进行大规模数据处理的事实上的标准。通常,RDF查询处理会创建连接密集型工作负载,从而导致冗长的MapReduce工作流,并带来昂贵的I/O、数据传输和排序成本。然而,MapReduce计算模型在关系数据库中提供了有限的静态优化技术(例如,索引和基于成本的优化)。因此,需要研究MapReduce上这种连接密集型任务的动态优化技术。在之前的一些工作中,我们提出了一种嵌套三组数据模型和代数(NTGA),用于云中高效的图形模式查询处理。在这里,我们使用扫描共享技术扩展了这项工作,该技术用于优化具有重复属性的图形模式的处理。具体来说,我们的扫描共享技术消除了在图形模式中重复使用属性时重复扫描输入关系的需要。讨论了扫描共享技术的正式基础,并提出了集成在Apache Pig框架中的实现策略。我们还提出了一个全面的评估,展示了我们的NTGA加扫描共享方法的性能优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信