Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

2010 IEEE 3rd International Conference on Cloud Computing Pub Date : 2010-07-05 DOI:10.1109/CLOUD.2010.36

M. Husain, L. Khan, Murat Kantarcioglu, B. Thuraisingham

{"title":"Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools","authors":"M. Husain, L. Khan, Murat Kantarcioglu, B. Thuraisingham","doi":"10.1109/CLOUD.2010.36","DOIUrl":null,"url":null,"abstract":"Cloud computing is the newest paradigm in the IT world and hence the focus of new research. Companies hosting cloud computing services face the challenge of handling data intensive applications. Semantic web technologies can be an ideal candidate to be used together with cloud computing tools to provide a solution. These technologies have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs. In this paper, we describe a framework that we built using Hadoop, a popular open source framework for Cloud Computing, to store and retrieve large numbers of RDF triples. We describe a scheme to store RDF data in Hadoop Distributed File System. We present an algorithm to generate the best possible query plan to answer a SPARQL Protocol and RDF Query Language (SPARQL) query based on a cost model. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can easily handle billions of RDF triples, unlike traditional approaches.","PeriodicalId":375404,"journal":{"name":"2010 IEEE 3rd International Conference on Cloud Computing","volume":"81 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"90","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 3rd International Conference on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD.2010.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 90

Abstract

Cloud computing is the newest paradigm in the IT world and hence the focus of new research. Companies hosting cloud computing services face the challenge of handling data intensive applications. Semantic web technologies can be an ideal candidate to be used together with cloud computing tools to provide a solution. These technologies have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs. In this paper, we describe a framework that we built using Hadoop, a popular open source framework for Cloud Computing, to store and retrieve large numbers of RDF triples. We describe a scheme to store RDF data in Hadoop Distributed File System. We present an algorithm to generate the best possible query plan to answer a SPARQL Protocol and RDF Query Language (SPARQL) query based on a cost model. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can easily handle billions of RDF triples, unlike traditional approaches.

查看原文本刊更多论文

使用云计算工具的大型RDF图的数据密集型查询处理

云计算是IT界的最新范式，因此也是新研究的焦点。托管云计算服务的公司面临着处理数据密集型应用程序的挑战。语义web技术是与云计算工具一起使用以提供解决方案的理想选择。这些技术已经被万维网联盟(W3C)标准化。其中一个标准是资源描述框架(RDF)。随着语义web技术的爆炸式发展，大型RDF图成为了常见的地方。当前的框架不能扩展到大型RDF图。在本文中，我们描述了一个使用Hadoop(一个流行的云计算开源框架)构建的框架，用于存储和检索大量RDF三元组。描述了一种在Hadoop分布式文件系统中存储RDF数据的方案。我们提出了一种算法来生成最佳查询计划，以响应基于成本模型的SPARQL协议和RDF查询语言(SPARQL)查询。我们使用Hadoop的MapReduce框架来回答这些查询。我们的结果表明，我们可以在使用廉价的商用类硬件构建的Hadoop集群中存储大型RDF图。此外，我们还展示了我们的框架是可伸缩的、高效的，与传统方法不同，它可以轻松地处理数十亿个RDF三元组。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE 3rd International Conference on Cloud Computing

自引率

0.00%

发文量